Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It makes me chuckle when hash maps are stated to be O(1) insertions. Which is true, in respect to the number of items in the map, assuming the map doesn't need resizing and there isn't a hash collision... but it's generally not true in respect to the key length. (I think most implementations are O(ln), where l is the length of the key and n is the number of inserted items, assuming the hash function is O(l) - the _amortised_ runtime would be O(l))


> assuming the map doesn't need resizing

This isn't a big difficulty; it's still amortized O(1).

> and there isn't a hash collision

This is a real difficulty, unless you allow map resizing. Luckily, we do.

> but it's generally not true in respect to the key length.

OK, but in most cases the key length is constant, making anything that depends on the key length O(1) by definition.


I wrote my own version of a part of a very popular Java scientific tool, and my version runs about 50 times faster. Their mistake? They had a hashCode() implementation on the objects they were using as keys for HashMaps that iterated through all of the voluminous content of that object. And there was no point - they could have used IdentityHashMaps instead with the same result. I pointed this out to them, and they still haven't fixed it.


I'm guessing GP means the complexity guarantee sidesteps the complexity of the hashing function. It probably doesn't matter all that much in typical case - I'm guessing 80-90% of hash map use is with very short strings.


Well-written complexity guarantees specify the operations they count. Otherwise sorting in O(n log(n)) also "sidesteps" the cost of comparison too.


And the analysis of hashmaps is not such a well-written guarantee -- as you resize, you need a bigger hash function output to reach all possible buckets. A bigger hash function output, assuming you have to keep the avalanche effect to keep output well-scrambled, requires more computations.

Earlier discussion: https://news.ycombinator.com/item?id=9807739


Short strings, long strings; they're going to use the same key length. Calculating the key may take longer for the long string, if you're basing the hash on the contents of the string[1], but the key won't end up being a different size. The md5 of a 3-byte string is 16 bytes and the md5 of a 40GB string is also 16 bytes.

[1] Not typical. e.g. Java takes the hash key of an object to be its address in memory, which doesn't require looking at the contents.


Calculating the key may take longer for the long string

Right, that’s exactly what they are warning about.

Not typical. e.g. Java takes the hash key of an object to be its address in memory

No, that’s just the base implementation in Object (and arguably it was a bad idea). All useful “value type” classes will override it with a real hash of the content, including String.

There are some cases in Java where you do want to use IDs instead of values as your map keys, but they’re rare.


> All useful “value type” classes will override it with a real hash of the content

Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...

> including String

...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.

I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.


Java does intern string literals and constants, but you can’t rely on reference equality unless you intern every string you create at runtime by formatting or decoding, and it isn’t specified whether that creates strong references that will never be GC’d.


Yes, Strings are immutable, so they only calculate their hashCode once, then cache it. However, you need to explicitly intern them with String.intern() if you want to avoid multiple copies of the same String.


> Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.

Hey, and now you have two problems: String hashing and finding all strings which are equal to each other in memory


Well, no, the whole point of this discussion is that solving the second problem means the first problem never comes up.

And this isn't exactly some exotic approach; how often do you think people write Hashes in Ruby where the keys they use are all symbols? It's so common that there's dedicated syntax for it.


It's as old as Lisp, but there's a reason symbols exist separately from strings - they're used differently. Strings are frequently transformed, symbols almost never are. String are frequently taken from end-user input, symbols very rarely. Strings sometimes are very large, symbol names are almost universally very short.

The problem is, interning is an expensive operation. It means adding to an ever growing database of strings, but first checking if the string isn't already there. You don't want to do that every time you change case or flip a letter in a string, or use it to access a hash table. I'm not saying it can't be done, but I honestly have no idea how to implement sane, generic, automatic interning of strings. I feel more comfortable having a symbol type, and control over turning strings into symbols.


I definitely agree that uninterned strings are important. All I'm really trying to say down here is that there are many cases where you have a hash table which uses strings as keys (as an implementation detail), when (conceptually) it wants to be using symbols.

(And on a less fundamental level, the particular Java String class is less string-like and more symbol-like than most string types, and this appears to have been done intentionally.)


If the key length is constant, the map as an upper limit on the number of possible elements, so all operations are constant time.


Key length must necessarily be O(log(N)) to be able to identify N different keys.


This is O(1) where N is constant.


Yes. Everything is O(1) if N is constant, including log(N), N^2, 2^N, N!, etc. That's a tautology.


> Everything is O(1) if N is constant, including log(N), N^2, 2^N, N!, etc.

Not even close. 2^k is not O(1) by virtue of N being constant. Only 2^N.

This has been covered above. It is more common to consider the complexity of hash table operations in terms of the number of operations, or the size of the table; the size of the key is very often constant. These are different variables; the constant size of the key does not trivialize the complexity of inserting N items each with a constant key size.


Here, the relevant key is the output of the hash function though -- that's what you need to increase in order to ensure you can reach all buckets. And that (k) must increase with the table size. So it is not constant and depends on n (table size).

Earlier discussion: https://news.ycombinator.com/item?id=9807739


I remember a proof in CLRS which first developed a function that was bounded above by 5 for all conceivable input ("a very quickly-growing function and its very slowly-growing inverse"), and then substituted the constant 4 or 5 into a complexity calculation in place of that function, giving a result which was "only" correct for all conceivable input.

The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.

On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?


>The same approach applies to key length requirements for hash tables with arbitrarily large backing stores. They do not grow as slowly as the CLRS log* function, but they grow so slowly that there are easily identifiable sharp limits on how large they can be -- an easy example is that a hash table cannot use more memory than the hardware offers no matter how the software is written. A backing store with 1TB of addressable bytes cannot need the key to be more than 40 bits long.

So? That's still putting a bound on table size, which makes it in-practice constant, but doesn't make the algorithm O(1), because you can never get such a result by bounding n, for the reasons the GGP gave -- that's cheating.

Your complexity bound has to be written on the assumption that n (number of elements to store in hashtable) increases without bound. Assuming you will never use more that Y bytes of data is not valid.

>On a different note, by "table size" in my earlier comment I meant to refer to the number of entries in the table, not the capacity of the backing store. It seems like you might be using the same word for a different concept?

No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)


> No, I was using table size exactly as you, to mean the number of elements stored. Is there a reason my comments only made sense under a different definition? It not, be charitable. (And avoid using obscure terms.)

I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.

I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.


>I interpreted your comment to refer to the size of the backing store, because that is fundamentally what a hash key needs to be able to address.

Under the assumption (upthread) of constant resizing as element are added, the distinction is irrelevant. The more elements you have in the table, the more elements you need to address, and the more possible outputs your hash function needs to have.

And the needed size of the backing store scales with the number elements you want to store anyway.

>I didn't mean to say that, if you were using it that way, you were doing anything wrong, only that there appeared to be a mismatch.

Why bring up something like that if it doesn't translate into something relevant to the discussion e.g. to show my point to be in error?


By that logic, any O(logN) factor is simply O(1). That O(NlogN) sort algorithm should just be considered O(N) for all practical purposes.


Incidentally, the person replying to you in that thread incorrectly stated that comparison is O(logN) on the number of bits. The most common comparison function, lexicographic comparison, is actually O(1) average case given random inputs of arbitrary length.


But, isn't the key length a constant and we are back to O(1)? Ok, in theory you could exhaust all possible keys of a certain length and proceed with longer keys. It would give us what? O(ln(n))?


His point is, if you use Moby Dick as the key, it's going to take longer to hash that than a three letter string. Hashing isn't O(1) if the key has variable size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: