Yeah, that's what it sounded like. Aside from not supporting incremental updates...

Robin_Message · on Feb 15, 2011

Isn't that O(Nk), where k is the average word length and N is the number of words. Given an upper bound on k, this is no worse than a trie. Of course, if I've misunderstood that part, then ignore what follows.

In a real world scenario (English language), since words share so much structure, this only grows as O(N). [1] Although the top 1000 recently searched items might not share as much structure, that is a trivial amount of data, even for O(N^2). The storage overhead on a trie is high, especially as many bigrams don't occur in English, so array-based tries necessarily waste space.

You're right that the lookup time is killer though. A ternary search tree gives you a good trade-off – really it's a just a good way to store a trie with less memory overhead.

1. Running the following commands

    # wc /usr/share/dict/words
    234936
    # cat /usr/share/dict/words | sed '{: foo
        p
        s/^\(.*\).$/\1/
        t foo
        p
    }' | sort | uniq | wc
    791089

shows only a factor of 3.3 increase in all the prefixes on a large dictionary of English. For propernames, 1323 words turns into 3612 prefixes, a factor of 2.7.

timr · on Feb 15, 2011

"Isn't that O(Nk), where k is the average word length and N is the number of words. Given an upper bound on k, this is no worse than a trie."

Let's make an even stronger assumption that every word in your set has length k. You then need to store (k^2 + k)/2 characters per distinct word with the sorted set implementation (plus any additional data-structure overhead). In a worst-case scenario, you have N of these k-length words, distinct at the first character (this might seem paranoid for something like ASCII, but think about Unicode here...or just change the problem to N distinct strings at character >=2), resulting in a worst-case memory bound of O(N * ((k^2 + k) / 2)).

That said, the possible range of N will be much larger than k for any non-trivial k, so I suppose it's fair to say that real-world memory use is something more like O(N) with a very high constant that's quadratically related to your average word length. But that's still substantially worse than a well-implemented trie (where your constant is small and unrelated to k), and with large data sets every byte counts.

"Although the top 1000 recently searched items might not share as much structure, that is a trivial amount of data, even for O(N^2)....the storage overhead on a trie is high, especially as many bigrams don't occur in English, so array-based tries necessarily waste space."

If your set size is only 1,000, then yeah, this conversation is academic. I'm talking about big suggestion sets -- for a website, it's not too difficult to get into a situation where your suggestion set is in the hundreds of thousands or millions of strings or more. Remember that web text is nothing like /usr/share/dict. Proper nouns, acronyms, non-words, new words, foreign words, etc. all conspire to make real-world language much larger and more diverse than the system dictionary. The size of real-world vocabularies is one of those counter-intuitive things that I wouldn't have guessed before working on websites.

Finally, while you definitely have to be careful when implementing a trie, there are improved trie algorithms (like burst tries), that greatly improve upon average- and worst-case memory usage. A totally naïve trie implementation has a ton of overhead...but then, a naïve set implementation has a ton of overhead, too. So it's important to not be naïve. ;-)

(related: looks like the redis guys have been oddly resistant to adding tries, despite several requests: https://code.google.com/p/redis/issues/detail?id=110)

Robin_Message · on Feb 16, 2011

(These are all excellent points and it's been a pleasure to discus them with you.)

Yes, it should have been O(N * k^2). I just didn't think it was as bad as O(N^2). (Of course if you switch to Unicode, your trie gets even more exciting ;)

I guess suggestions on search queries is more complex since you're now talking about sentences. I'm not convinced that web text is nothing like the system dictionary though – that dictionary has 300 thousand words, which dwarfs even the most advanced user of the English language's lexicon. Also, proper nouns and new words still follow the phonetic structure of your base language, so will compress well, but you're right on acronyms and non-words. Do you have any references on the size of real-world vocabularies? Obviously personal experience with implementing such systems counts for a great deal.

Finally, I agree that a non-naive trie is the best data structure for this (in fact, they seem to be the best data structure full-stop – some tries beat hash tables on lookup by index) but if you are building on the primitives you've got, sorted sets don't seem too bad.