Open Side Menu Go to the Top
Register
Information content of long streams of numbers Information content of long streams of numbers

05-17-2012 , 05:35 PM
Imagine i gave you a random 20 digit number say ;

43958610982765353257

and asked you to represent it in a more compact manner eg in the sense of say

30!+2^57=265252859812191202751496555855872

or 777654321^2 + 1=604746242969971042

etc you get the idea. I selected the number randomly so it may be very hard to simplify or entirely lucky and indeed see a dramatic reduction in length. If so it was not intentional.

Lets say we can use the basic symbols like +-*/()^! and numbers 0-9 and maybe an extra separator type character like | to signify the end of a stream representation with operations and the beginning of the next etc . That way we may play with 1 billion digits in packets of 20 -30 or whatever seems convenient at the time.

Imagine we have massive computational power and plenty of time and huge local memory (database capacity) but memory restrictions in the finished representation (or it could be transmission length restrictions).

So the objective is now this;


Start with a stream of very long seemingly random integers. It can be length 20 blocks or length 100 or whatever or it can even be variable depending how we find patterns. Our objective is to transform them from the numerical 0-9 digital form (or whatever binary effective later) to a functional representation form that can be re-encoded again using 0-9.

Will we be able to compact the long stream of integers to something smaller with the above procedure?

Obviously some numbers are dramatically easy like 10^100 or 2^800 etc lol

Additionally we may try sums of cubes or squares or products of powers of primes+- corrective terms etc

In any case if the process works we may end up with less memory needed to store the numbers (but unlike jpeg say compression we lose nothing, its easily reversible just do the math lol).


Take now the resulting number stream that is possibly significantly shorter than the original and try to do the same one more time!!! LOL

Imagine some crazy effort that after repeated such transformations we end up in something ultra small that of course takes time to reproduce to original form but with computers its very fast because in the final encoded message there are all the instructions needed to rebuild the prior level compression and so on until we reach the final form that is only numbers and no symbols or operations.


Is there some universal limit in this process for random numbers? Or am i missing something important here?


Can it be that introducing the extra symbols above eg +-*/()^!| creating in effect a ~19 base system (feel free to find another useful symbol i missed and take it to 20-base) what we save from the very good cases like 56^17 kind of representation (that takes only 5 symbols in base 19 or 20 to represent something that takes 29 digits in base 10 system) or 788! etc we lose back eventually because we switched from 10 to 20 base and the majority of numbers are simply too complex to decode to something simple?


So basically i ask what is the avg information content of a long random integer???

Is it possible with repeated application of this scheme to arrive at some massive spectacular compression if we have enough computing power to search for patterns and simplify long streams in the manner described?
Information content of long streams of numbers Quote
05-17-2012 , 05:54 PM
Quote:
Originally Posted by lastcardcharlie
owned.
Information content of long streams of numbers Quote
05-17-2012 , 06:21 PM
Quote:
Originally Posted by jewbinson
owned.
I suggest you read the link (had read before and today again) (but i will go over further again in time) and what i posted and re-examine the concept of "owned". Of course i am somewhat familiar with Kolmogorov complexity but i am looking into practical theorems about what i asked not general terminology and little connection. So the link is of no help directly to answer what i asked. Maybe some theorem in it is however, so i will check more.

If it is then show me where in the link there are exact results that matter to the thread and then i ask lastcardcharlie why he didnt take the time to point the exact theorems or other references in the first place ???
Information content of long streams of numbers Quote
05-17-2012 , 06:32 PM
Dude, chill. It was just coz he posted a 1-word answer that I wrote that. I was joking...
Information content of long streams of numbers Quote
05-17-2012 , 06:54 PM
Quote:
Originally Posted by masque de Z
Of course i am somewhat familiar with Kolmogorov complexity...
Why of course? Is it such an important topic in physics?
Information content of long streams of numbers Quote
05-17-2012 , 07:00 PM
Similar questions have been asked in the Usenet compression newsgroup from the beginning of time. Some "regulars" have tried to push their methods for decades!

AFAIK the bottom line is that for all numbers/files/messages there is no free lunch. Compression can only be achieved for certain strings.

In terms of a small set of big numbers. There are some well known open compression challenges with particular sets of numbers deemed "random" out there, some big names have tried many approaches. If a google search fails, try searching the compression newsgroups.

Last edited by Chasqui; 05-17-2012 at 07:11 PM.
Information content of long streams of numbers Quote
05-17-2012 , 07:45 PM
Quote:
Originally Posted by lastcardcharlie
Why of course? Is it such an important topic in physics?
It might develop into one. Most certainly black holes entropy Shannon entropy etc are not exactly unrelated issues.

Look here for instance to see some activity to be further investigated for fun;

http://inspirehep.net/search?ln=en&l...rm=&rg=25&sc=0

One is clearly benefited by keeping an open attitude in modern physics.
Information content of long streams of numbers Quote
05-17-2012 , 07:46 PM
Sounds like a data compression algorithm question. I believe it has been shown that data compression can never "reliably" compress data, I don't remember exactly how it works but it's something like if there is some string x1 that can be compressed by a data compression scheme then there is some y1 that will actually take up more data under that same scheme.

Or to put it another way, if you're using a string of 20 characters to store a number, then the total set of numbers you can store is only r20, where r is the range of possible characters.

Using 20 characters of the " +-/*^(){}0123456789" range is going to be able to store more individual values than using 20 characters of the "0123456789" range, but beyond that you can't "buy" any new "slots," with the same character range a sequence of n characters will hold rn possible states that can map to no more than rn referents.

It's just that n boxes can only hold n values, you have no wiggle room for that. On the other hand, in things like file compression and image compression the data tends to follow certain patterns so in practice you can gain a benefit by using a compression scheme that simplifies the most commonly found patterns. But in principle, because the number of boxes is finite, if you move a pattern from a "long" box into a "short" box, then you must also move a pattern from a "short" box into a "long" box.
Information content of long streams of numbers Quote
05-17-2012 , 08:03 PM
Quote:
Originally Posted by masque de Z
Will we be able to compact the long stream of integers to something smaller with the above procedure?
In the average case, no. Almost all strings do not have patterns that can be exploited to compress them. The proof is very similar in concept to showing that almost all real numbers are transcendental.
Information content of long streams of numbers Quote
05-17-2012 , 08:32 PM
Quote:
Originally Posted by madnak
Sounds like a data compression algorithm question. I believe it has been shown that data compression can never "reliably" compress data, I don't remember exactly how it works but it's something like if there is some string x1 that can be compressed by a data compression scheme then there is some y1 that will actually take up more data under that same scheme.

Or to put it another way, if you're using a string of 20 characters to store a number, then the total set of numbers you can store is only r20, where r is the range of possible characters.

Using 20 characters of the " +-/*^(){}0123456789" range is going to be able to store more individual values than using 20 characters of the "0123456789" range, but beyond that you can't "buy" any new "slots," with the same character range a sequence of n characters will hold rn possible states that can map to no more than rn referents.

It's just that n boxes can only hold n values, you have no wiggle room for that. On the other hand, in things like file compression and image compression the data tends to follow certain patterns so in practice you can gain a benefit by using a compression scheme that simplifies the most commonly found patterns. But in principle, because the number of boxes is finite, if you move a pattern from a "long" box into a "short" box, then you must also move a pattern from a "short" box into a "long" box.
I had a feeling this was the case, but couldn't prove it. Proof seems obv to me now, ty for that
Information content of long streams of numbers Quote
05-17-2012 , 09:11 PM
Quote:
Originally Posted by dessin d'enfant
In the average case, no. Almost all strings do not have patterns that can be exploited to compress them. The proof is very similar in concept to showing that almost all real numbers are transcendental.
Can you link to such proof? I mean because although i intuitively see what you mean and was afraid for it as my original post hinted, the connection with polynomials and rational coefficients is not exactly identical (but suggestive anyway) to me to the "space" of all possible operations i proposed including powers, factorials, prime numbers sums of powers, products etc.

Last edited by masque de Z; 05-17-2012 at 09:16 PM.
Information content of long streams of numbers Quote
05-17-2012 , 10:40 PM
Quote:
Originally Posted by masque de Z
Can you link to such proof? I mean because although i intuitively see what you mean and was afraid for it as my original post hinted, the connection with polynomials and rational coefficients is not exactly identical (but suggestive anyway) to me to the "space" of all possible operations i proposed including powers, factorials, prime numbers sums of powers, products etc.
See this paper by Shannon.

For a friendly intro, see wiki article on Shannon Entropy



I think the first link I gave is behind a pay wall....free version should be here

Last edited by dessin d'enfant; 05-17-2012 at 10:56 PM.
Information content of long streams of numbers Quote
05-17-2012 , 10:54 PM
Quote:
Originally Posted by dessin d'enfant
See this paper by Shannon.

For a friendly intro, see wiki article on Shannon Entropy
Is this the same 1948 paper?

http://www.alcatel-lucent.com/bstj/v...tj27-3-379.pdf
Information content of long streams of numbers Quote
05-17-2012 , 11:10 PM
Quote:
Originally Posted by masque de Z
Yeah sorry....It's Shannon's 1948 paper "A Mathematical Theory of Communication"....I didn't realize initially that it was behind a paywall. My edited link (as does the one you provided) gives a free version.
Information content of long streams of numbers Quote
05-17-2012 , 11:19 PM
I was familiar from Physics with Shannon entropy and i have seen that paper before but never read it in complete detail. Are you sure in it he has a proof that a random sequence of digits cannot be compressed on average more than its initial form even if you have vast computational power and ability to search for all conceivable patterns? I am not exactly sure how they deal with the concept of all potential ideas for representing large integers. When i was briefly going over these links i never found anything very close to what i was saying or maybe what was saying can be converted in a formalism that fits better these papers but i didnt frankly see the connection as i wanted it to be. If you are sure its in there i will go over it but if you know where exactly i would appreciate it.
Information content of long streams of numbers Quote
05-18-2012 , 12:48 AM
Somewhat on topic I was thinking of the following method for compressing things:

Start your sequence with a binary number, 1 or 0.

If it's 0, it is not compressed, data is just as is

if it's 1, the following n alphabetical characters describe what algorithm we are using, therefore we have n^26 different possible algorithms

So for example say we have the sequence of like the first 100 digits of Pi:

31415926....

We could send this as either uncompressed:

031415926....

or compressed, (let's use n=3, which is known to the recipient in advance):

1rar72245366....

so in this case the recipient knows we are using the 'rar' algorithm, so he can decode the 72245366 into the original 31415926

This approach would give us access to an infinite number of compression algorithms, at the cost of a single binary digit in front of the data. A pretty good deal imo.

In real life data is usually not purely random, so this method is good imo

But if the sequence is purely random, I believe that on average we are no better off, or worse off, using this method.

My question is, using the above method applied to random sequences, can we pick the right set of 26^n algorithms so that we are on average just as efficient compared to just sending the data? I am guessing yes but I have no proof.

If yes, can anyone find an example of a set of compression algorithms that will achieve this?

Of course we can just have a set of 2 algorithms having n=1 and using letters just a and b

Or even just 1 algorithm, and n = 0
Information content of long streams of numbers Quote
05-18-2012 , 12:54 AM
I think I just answered my own question.

One algorithm:

if 1, then the first number in the sequence is repeated at least once, so we will drop the first number

so for example:

the sequence 876543 will not be compressed, we send it as:

0876543

but the sequence 886543 will be compressed, we send it as:

186543
Information content of long streams of numbers Quote
05-18-2012 , 02:33 AM
Quote:
Originally Posted by masque de Z
I was familiar from Physics with Shannon entropy and i have seen that paper before but never read it in complete detail.
I only read it through once when i was an undergrad but i did read it line for line....I would definitely recommend it if you have a few hours to kill. It's a very clear and easy to understand paper even if you aren't an expert in subject (and I say that as a non expert)

Quote:
Are you sure in it he has a proof that a random sequence of digits cannot be compressed on average more than its initial form
Yes...it's really just following what Madnak said....which is just a restatement of the pigeon hole principle.

Quote:
even if you have vast computational power and ability to search for all conceivable patterns?
For this part we have to define how vast "vast" is . If you have so much computational power to where you effectively have an ExpSpace oracle you can compress very, very large solution spaces to a just 1 symbol without violating the pigeon hole principle. But obv that's cheating wrt the question you asked.

Quote:
I am not exactly sure how they deal with the concept of all potential ideas for representing large integers
Luckily, we don't have to worry about that except in the most trivial way. Asymptotically, it's obvious since you can't count the reals. But it turns out finite questions can be proved essentially the same way. Just define your compression space to have X members with some alphabet of Y symbols. There exists no bijection from any space with more than X^Y members to this compression space. Basically, once you rigorously define how many symbols you can use and how large your alphabet is I can tell you the exact average length of string that you can no longer compress regardless of how clever you are at compressing strings.
Information content of long streams of numbers Quote
05-20-2012 , 12:52 AM
Quote:
Originally Posted by Clue
Somewhat on topic I was thinking of the following method for compressing things:

Start your sequence with a binary number, 1 or 0.

If it's 0, it is not compressed, data is just as is

if it's 1, the following n alphabetical characters describe what algorithm we are using, therefore we have n^26 different possible algorithms
You are adding one character to all messages plus the n character to all compressed ones. You'll have to keep track of all you add very carefully.

Quote:
Originally Posted by Clue
This approach would give us access to an infinite number of compression algorithms, at the cost of a single binary digit in front of the data. A pretty good deal imo.
You forget the n digits for the compressed strings.

Quote:
Originally Posted by Clue

In real life data is usually not purely random, so this method is good imo

But if the sequence is purely random, I believe that on average we are no better off, or worse off, using this method.
Your method will work well on some sets of data. But you'll always be equal or worse off on average.
Quote:
Originally Posted by Clue

My question is, using the above method applied to random sequences, can we pick the right set of 26^n algorithms so that we are on average just as efficient compared to just sending the data? I am guessing yes but I have no proof.
The answer is not better. Start with the Shannon paper and the Kolmogorov links posted itt. Be warned that can get weird down this rabbit hole.
Information content of long streams of numbers Quote
05-20-2012 , 01:16 AM
Quote:
Originally Posted by Clue
I think I just answered my own question.

One algorithm:

if 1, then the first number in the sequence is repeated at least once, so we will drop the first number

so for example:

the sequence 876543 will not be compressed, we send it as:

0876543

but the sequence 886543 will be compressed, we send it as:

186543
You are doing digit substitution, you transmit the same amount but have overhead in the decompressor (if first_digit=1 then ignore_and_repeat_second_digit).

The decompressor had to be transmitted at least once right?. Maybe the compressor zips better with the new lines of code in it?. Food for thought.
Information content of long streams of numbers Quote
05-20-2012 , 01:45 AM
Quote:
Originally Posted by Chasqui
Your method will work well on some sets of data. But you'll always be equal or worse off on average.
Yeah....this is the key point. Since OP was talking about random strings, it is simply impossible to compress them beyond their Shannon Entropy, and that's a theorem.

Of course the HUGE 'loophole' is that people don't care about compressing random strings....they care about compressing Shakespeare plays, Da Vinci paintings, Beatles song, Kubrick movies etc. But if you can achieve a good level of compression for those things whatever method you use must necessarily increase the storage limit of other (perhaps random) data. It's very analogous to how you can turn on your AC on and decrease the entropy in your house but, because of the 2nd law, there exists some superset in space of the house in which the total entropy has increased.
Information content of long streams of numbers Quote
05-20-2012 , 02:35 AM
Quote:
Originally Posted by Chasqui
You forget the n digits for the compressed strings.
What I meant was, if we do choose to use an algorithm, yes there are n extra digits but it's worth it for some particular algorithm.

But if there is no good algorithm, then we've only added a single binary digit, so we are only "paying" a single binary digit worst case scenario
Information content of long streams of numbers Quote
05-25-2012 , 11:13 AM
Quote:
Originally Posted by jewbinson
Dude, chill. It was just coz he posted a 1-word answer that I wrote that. I was joking...
Its not a very good joke considering that every single person replying of substance is going to have heard of kolmogorov complexity
Information content of long streams of numbers Quote
05-25-2012 , 11:33 AM
So, this is a finite problem, obviously if we are dealing with infinite strings then there is no way to compress.

I think with finite strings the issue is the same as people said. Basically you can think of a writing (eg. 10^300 is a writing) as just trying to encode sequences in (0,1,....,9) in the alphabet (0,1,2,3,4,5,6,7,8,9,+,-,*,^). Then we are looking at eg. 13^80 vs 10^100, for example. So I'm not sure compression is so easy, since 13^80 is about 10^89. For blocks of 20, 13^15 = 10^16 or so, so its the same problem.
Information content of long streams of numbers Quote

      
m