>As for hash tables, can't you guarantee uniqueness by using only reversible...

its_so_on · on May 29, 2012

actually, just as long.

Any set of possible bits is a valid input, so the input has one bit of entropy per bit. If the hash were not "just as long" then one of the following must be true: - it collides (two possible inputs would have the same hash) - some inputs could not hash - the algorithm encodes more than 2 bits of entropy per bit.

This is not possible by the pigeonhole principle. http://en.wikipedia.org/wiki/Pigeonhole_principle

There's probably a more rigorous proof here but I'm lazy. I guess something like imagine the input is going to be n bits (like 16 bits)...make an array of length 2^n and initialize them all to -1, and then for each possible input (0..2^n-1) hash it and - since the hash is not longer than the input, the representation of the hash is there in the array if you simply interpret it as the index (subscript). Set that to the n that produced it, if it's not already set (otherwise abort with a collision). After you have reached 2^n-1, you have set 2^n indices (including whatever hash 0 produced). Is it possible for you not to have made a collision? (Yes, easily: for example anything hashes to the next integer, except binary all 1's which hash to all 0's). Could some of the indices to be empty? No. If you put 2^n values in 2^n holes distinctly, there is one value per hole. Conversely, if you had anything LESS than that many holes to put it into (for example, you're trying to hash 32 bits of input into just the lower half of the indices, by creating a hash that uniquely hashes to 16 bits of output) then by the pigeonhole prinicpal you have to put two n's into the same index.

So, any hash that would be unique (that can hash anything) has to be at least exactly as long as what it's hashing. If there are any outputs that are never hashed to, it would have to be even longer.

Note that this proof depends on iterating on every possible input. If some inputs aren't possible, the hash could be unique while being shorter than the input. This is how lossless compression works.

mistercow · on May 29, 2012

>actually, just as long

Well, more accurately, at least as long, on average.

It would be possible, though, to have a guaranteed unique string which is usually shorter than the input for real world data, which is called lossless compression. Obviously, it is not possible to provide a fixed digest length though, which makes it useless for many hash function applications, like bloom filters. And, of course, a hash table implementation based on DEFLATE would be hilariously inefficient.

its_so_on · on May 29, 2012

Fair enough. The proof is that you can't shave even a single bit off of every output, so that given any file that's 1024 bytes, you can hash it to 1023 bytes. Not possible unless you're hiding that entropy somewhere (filename etc).

dllthomas · on May 29, 2012

The proof is straightforward application of the pigeonhole principle. Input strings are your pigeons, hashes are your holes.

its_so_on · on May 30, 2012

what you've just said says nothing about the length of the hashes, which was my point. You would have to say something like "all possible inputs of a certain length are your pigeons, and these same possibilities(1) are the holes"

(1)this time interpreted as the hash of one and only one of the same list of possible inputs

The proof is not so straightforward, because it's not always true when you expect it to be true. For example, can you hash, reversibly and uniquely, all the numbers between 0 and 10 (real) to all the numbers between 0 and 1?

You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1". But that's actually not necessarily true the way you might expect, and the answer is actually "yes you can". Simply geometrically establish a coordinate plane like this

so that point (a) has a well-defined coordinate, say (0,1) the 0 from the second line is at the origin, the 1 at the second line is at (1,0) and the bottom 0 is at (0,-1) with the bottom 10 being wherever the ray passing through a and passing through (1,0) - enough to uniquely identify the ray - has the y-value of -1.

Now to get the hash simply move 'b' to wherever between 0 and 1 you're trying to hash, solve for the ray that passes through a and that point, and solve for the x-value of the ray where it has y-value of -1.

Through elementary proof which should be obvious, you'll see that any b casts a distinct ray (which is easily reversible introducing a c into line 3), and you can solve for it if you like.

I'm not saying that you're wrong, but I think the proof isn't as "obvious" and straightforward as you're saying. It doesn't apply to just anything, but to the specific case of the impossibility of hashing distinct length inputs uniquely into a discrete space defined by a uniformly shorter length of bits.

dllthomas · on May 30, 2012

From wikipedia, "A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length."

Given the pigeonhole principle, and the observation that there are fewer strings of length K than strings of any length, you have that you cannot map every string into a unique hash. That's what I meant by "a straightforward application of the pigeonhole principle".

If you're dealing with an infinite set of hashes (I'm not sure what that would look like, but hey), then of course you need to take into account the limitations of the pigeonhole principle when dealing with infinite sets - specifically, that it only applies if you have a set of pigeons of a larger cardinality than your set of holes. [0,1] and [0,10] have the same cardinality.

mistercow · on May 30, 2012

>You would expect the answer to be "no" since there are "more real numbers from 0 to 10 than from 0 to 1"

Well, the quotes you had to put there say it all. You might expect the answer to be "no" if you didn't understand how infinity works.

gizzlon · on May 29, 2012

I was indeed thinking of lossless compression: In practice it compresses almost all its input. If the the output length does not have to be constant, it's fine that some input actually gets inflated. As long as it compresses the average input.

But in this case the output length might have to be constant? If that's the case, you are of course correct.

SoftwareMaven · on May 29, 2012

Lossless compression doesn't compress random data. You will still have the same number of bits in and out.

gizzlon · on May 30, 2012

Who says the input is necessarily random? ;)