|
|
| Science, Math, and Philosophy Discussions regarding science, math, and/or philosophy. |
05-17-2012, 11:19 PM
|
#16
|
|
veteran
Join Date: Aug 2009
Location: Stanford, CA USA
Posts: 3,398
|
Re: Information content of long streams of numbers
I was familiar from Physics with Shannon entropy and i have seen that paper before but never read it in complete detail. Are you sure in it he has a proof that a random sequence of digits cannot be compressed on average more than its initial form even if you have vast computational power and ability to search for all conceivable patterns? I am not exactly sure how they deal with the concept of all potential ideas for representing large integers. When i was briefly going over these links i never found anything very close to what i was saying or maybe what was saying can be converted in a formalism that fits better these papers but i didnt frankly see the connection as i wanted it to be. If you are sure its in there i will go over it but if you know where exactly i would appreciate it.
|
|
|
05-18-2012, 12:48 AM
|
#17
|
|
centurion
Join Date: May 2012
Posts: 172
|
Re: Information content of long streams of numbers
Somewhat on topic I was thinking of the following method for compressing things:
Start your sequence with a binary number, 1 or 0.
If it's 0, it is not compressed, data is just as is
if it's 1, the following n alphabetical characters describe what algorithm we are using, therefore we have n^26 different possible algorithms
So for example say we have the sequence of like the first 100 digits of Pi:
31415926....
We could send this as either uncompressed:
031415926....
or compressed, (let's use n=3, which is known to the recipient in advance):
1rar72245366....
so in this case the recipient knows we are using the 'rar' algorithm, so he can decode the 72245366 into the original 31415926
This approach would give us access to an infinite number of compression algorithms, at the cost of a single binary digit in front of the data. A pretty good deal imo.
In real life data is usually not purely random, so this method is good imo
But if the sequence is purely random, I believe that on average we are no better off, or worse off, using this method.
My question is, using the above method applied to random sequences, can we pick the right set of 26^n algorithms so that we are on average just as efficient compared to just sending the data? I am guessing yes but I have no proof.
If yes, can anyone find an example of a set of compression algorithms that will achieve this?
Of course we can just have a set of 2 algorithms having n=1 and using letters just a and b
Or even just 1 algorithm, and n = 0
|
|
|
05-18-2012, 12:54 AM
|
#18
|
|
centurion
Join Date: May 2012
Posts: 172
|
Re: Information content of long streams of numbers
I think I just answered my own question.
One algorithm:
if 1, then the first number in the sequence is repeated at least once, so we will drop the first number
so for example:
the sequence 876543 will not be compressed, we send it as:
0876543
but the sequence 886543 will be compressed, we send it as:
186543
|
|
|
05-18-2012, 02:33 AM
|
#19
|
|
old hand
Join Date: May 2012
Posts: 1,536
|
Re: Information content of long streams of numbers
Quote:
Originally Posted by masque de Z
I was familiar from Physics with Shannon entropy and i have seen that paper before but never read it in complete detail.
|
I only read it through once when i was an undergrad but i did read it line for line....I would definitely recommend it if you have a few hours to kill. It's a very clear and easy to understand paper even if you aren't an expert in subject (and I say that as a non expert)
Quote:
|
Are you sure in it he has a proof that a random sequence of digits cannot be compressed on average more than its initial form
|
Yes...it's really just following what Madnak said....which is just a restatement of the pigeon hole principle.
Quote:
|
even if you have vast computational power and ability to search for all conceivable patterns?
|
For this part we have to define how vast "vast" is  . If you have so much computational power to where you effectively have an ExpSpace oracle you can compress very, very large solution spaces to a just 1 symbol without violating the pigeon hole principle. But obv that's cheating wrt the question you asked.
Quote:
|
I am not exactly sure how they deal with the concept of all potential ideas for representing large integers
|
Luckily, we don't have to worry about that except in the most trivial way  . Asymptotically, it's obvious since you can't count the reals. But it turns out finite questions can be proved essentially the same way. Just define your compression space to have X members with some alphabet of Y symbols. There exists no bijection from any space with more than X^Y members to this compression space. Basically, once you rigorously define how many symbols you can use and how large your alphabet is I can tell you the exact average length of string that you can no longer compress regardless of how clever you are at compressing strings.
|
|
|
05-20-2012, 12:52 AM
|
#20
|
|
adept
Join Date: Sep 2009
Location: location ,location.
Posts: 1,086
|
Quote:
Originally Posted by Clue
Somewhat on topic I was thinking of the following method for compressing things:
Start your sequence with a binary number, 1 or 0.
If it's 0, it is not compressed, data is just as is
if it's 1, the following n alphabetical characters describe what algorithm we are using, therefore we have n^26 different possible algorithms
|
You are adding one character to all messages plus the n character to all compressed ones. You'll have to keep track of all you add very carefully.
Quote:
Originally Posted by Clue
This approach would give us access to an infinite number of compression algorithms, at the cost of a single binary digit in front of the data. A pretty good deal imo.
|
You forget the n digits for the compressed strings.
Quote:
Originally Posted by Clue
In real life data is usually not purely random, so this method is good imo
But if the sequence is purely random, I believe that on average we are no better off, or worse off, using this method.
|
Your method will work well on some sets of data. But you'll always be equal or worse off on average.
Quote:
Originally Posted by Clue
My question is, using the above method applied to random sequences, can we pick the right set of 26^n algorithms so that we are on average just as efficient compared to just sending the data? I am guessing yes but I have no proof.
|
The answer is not better. Start with the Shannon paper and the Kolmogorov links posted itt. Be warned that can get weird down this rabbit hole.
|
|
|
05-20-2012, 01:16 AM
|
#21
|
|
adept
Join Date: Sep 2009
Location: location ,location.
Posts: 1,086
|
Quote:
Originally Posted by Clue
I think I just answered my own question.
One algorithm:
if 1, then the first number in the sequence is repeated at least once, so we will drop the first number
so for example:
the sequence 876543 will not be compressed, we send it as:
0876543
but the sequence 886543 will be compressed, we send it as:
186543
|
You are doing digit substitution, you transmit the same amount but have overhead in the decompressor (if first_digit=1 then ignore_and_repeat_second_digit).
The decompressor had to be transmitted at least once right?. Maybe the compressor zips better with the new lines of code in it?. Food for thought.
|
|
|
05-20-2012, 01:45 AM
|
#22
|
|
old hand
Join Date: May 2012
Posts: 1,536
|
Re: Information content of long streams of numbers
Quote:
Originally Posted by Chasqui
Your method will work well on some sets of data. But you'll always be equal or worse off on average.
|
Yeah....this is the key point. Since OP was talking about random strings, it is simply impossible to compress them beyond their Shannon Entropy, and that's a theorem.
Of course the HUGE 'loophole' is that people don't care about compressing random strings....they care about compressing Shakespeare plays, Da Vinci paintings, Beatles song, Kubrick movies etc. But if you can achieve a good level of compression for those things whatever method you use must necessarily increase the storage limit of other (perhaps random) data. It's very analogous to how you can turn on your AC on and decrease the entropy in your house but, because of the 2nd law, there exists some superset in space of the house in which the total entropy has increased.
|
|
|
05-20-2012, 02:35 AM
|
#23
|
|
centurion
Join Date: May 2012
Posts: 172
|
Re: Information content of long streams of numbers
Quote:
Originally Posted by Chasqui
You forget the n digits for the compressed strings.
|
What I meant was, if we do choose to use an algorithm, yes there are n extra digits but it's worth it for some particular algorithm.
But if there is no good algorithm, then we've only added a single binary digit, so we are only "paying" a single binary digit worst case scenario
|
|
|
05-25-2012, 11:13 AM
|
#24
|
|
aka Double Ice
Join Date: Jun 2007
Location: Twitter
Posts: 4,568
|
Re: Information content of long streams of numbers
Quote:
Originally Posted by jewbinson
Dude, chill. It was just coz he posted a 1-word answer that I wrote that. I was joking...
|
Its not a very good joke considering that every single person replying of substance is going to have heard of kolmogorov complexity
|
|
|
05-25-2012, 11:33 AM
|
#25
|
|
aka Double Ice
Join Date: Jun 2007
Location: Twitter
Posts: 4,568
|
Re: Information content of long streams of numbers
So, this is a finite problem, obviously if we are dealing with infinite strings then there is no way to compress.
I think with finite strings the issue is the same as people said. Basically you can think of a writing (eg. 10^300 is a writing) as just trying to encode sequences in (0,1,....,9) in the alphabet (0,1,2,3,4,5,6,7,8,9,+,-,*,^). Then we are looking at eg. 13^80 vs 10^100, for example. So I'm not sure compression is so easy, since 13^80 is about 10^89. For blocks of 20, 13^15 = 10^16 or so, so its the same problem.
|
|
|
06-01-2012, 09:51 PM
|
#26
|
|
newbie
Join Date: Nov 2009
Posts: 40
|
Re: Information content of long streams of numbers
Quote:
Originally Posted by masque de Z
Will we be able to compact the long stream of integers to something smaller with the above procedure?
Obviously some numbers are dramatically easy like 10^100 or 2^800 etc lol
Additionally we may try sums of cubes or squares or products of powers of primes+- corrective terms etc
|
i echo what's been said about the information theory stuff, but to be a little more illustrative about the random/average case using the described operation reduction: between almost any two pretty representations surrounding a random point, there's a gulf of numbers whose "similar" representation is nearly as complex. that plus/minus corrective term doesn't have very many fewer digits than the original.
for example, if we pretend 10^100 weren't so neat, the approximate
Quote:
|
2*56^57 + 5*57^56 + 56^56 + 2*55^56 + 2*56^55 + 54^55 + 55^54
|
differs from 10^100 by a 98-digit number, and continuing in this fashion quickly burns through our 100-character gas tank.
if you're going to give me a "huge" amount of some resource, though, obviously i can compress the universe into a look-up table the size of the universe and so on and so forth ( http://en.wikipedia.org/wiki/Space%E...3time_tradeoff ).
|
|
|
06-02-2012, 09:28 AM
|
#27
|
|
Carpal \'Tunnel
Join Date: Jun 2006
Location: 39, 46, 56, 59, 191
Posts: 39,984
|
Re: Information content of long streams of numbers
Quote:
|
Sounds like a data compression algorithm question. I believe it has been shown that data compression can never "reliably" compress data, I don't remember exactly how it works but it's something like if there is some string x1 that can be compressed by a data compression scheme then there is some y1 that will actually take up more data under that same scheme.
|
Pretty much the same argument as in "no free lunch in search and optimization". Cliffnotes: nitty definitions, pick domain specific algorithms, don't look for an ultimate, solve all algorithm.
|
|
|
06-02-2012, 02:50 PM
|
#28
|
|
veteran
Join Date: Aug 2009
Location: Stanford, CA USA
Posts: 3,398
|
Re: Information content of long streams of numbers
It must be evident by now that what i say can only appear possible in situations that a text like sequence that tends to involve also streams of numbers but not only is considered. If all you have to convert is just numbers and you use say 0-9 you can always convert that to a 20-base say system (or even 64 ascii type base whatever ) that involves 0-9!+-*/^()|p etc type of thing. (p= prime number say)
Now here is the easy to see problem. If we have a stream length n2 in the new system base 20 or whatever the number of unique elements that such stream can describe is obviously something like 20^n2. The collection of all possible symbolic math expressions that use 0-9!+-*/^()|p as symbols to produce representations of numbers is clearly a subset with much less than 20^n2 members!!! Why? Well because the original 0-9!+-*/^()|p 20 base set will include sequences like ()+^!*!^! etc that clearly make absolutely no sense mathematically (not permitted). Additionally many of such expressions will corresond to the same integer eg 2^1024 and 16^256 both mean the same. Also every expression of particular order of symbols and given length must correspond to a unique integer anyway (no more than 1). So the available possible expressions using symbolic forms are definitely much less than 20^n2. Although it is possible to represent using streams of length n2 extremely big integers that are far greater than 20^n2 the number of elements of all possible representations will be massively less than 20^n2 and this instantly makes the symbolic attempt miserably bad compared to the stupidly trivial alternative of expressing the 0-9 streams of digits of a decimal system representation to a new base 20 system of characters of 0-9!+-*/^()|p which of course we know cannot shrink anything and it has by 1-1 arguments the same number of members as the decimal stream has.
So basically any symbolic attempt to describe a number will be always a subset of all possible 20-base expressions using these symbolic characters and as such it will be wasting rather than saving space eventually compared to the alternative direct base 20 representation that isnt using anything other than the trivial polynomial base 20 representation.
So whatever i could possibly had in mind intuitively in that thread cannot be naively seen as along the lines of what i described above. Clearly this is not an issue of arguing one can save data by moving from one base to the subset of another...!
One can still however ask other related questions that may also prove trivial (eg equivalent) or not such as if you have an undefined length of decimal digits what is the probability (you start and add each additional random digit to produce a higher number and test it then one more etc until the number becomes easy to represent!) that eventually it will become at some large enough number of digits an easily expressed term using say 0-9!+-*/^()|p that will be more compact than the base 20 polynomial representation of the same number.
Other problems also appear like what is the information content of human speech. For example if we use ascii characters our texts will tend to have say in English 5 size streams like "angle" but not "lagen" or "lgnae" or "glnea" etc you see what i mean. So converting human text to more compact expressions using this observation might be substantially lucrative.
To be continued. Thanks all for posting.
|
|
|
06-02-2012, 06:04 PM
|
#29
|
|
adept
Join Date: Sep 2009
Location: location ,location.
Posts: 1,086
|
Quote:
Originally Posted by masque de Z
So converting human text to more compact expressions using this observation might be substantially lucrative.
|
It is, that's why text files compress so well.
|
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -4. The time now is 01:57 AM.
|