(These are all excellent points and it's been a pleasure to discus them with you.)
Yes, it should have been O(N * k^2). I just didn't think it was as bad as O(N^2). (Of course if you switch to Unicode, your trie gets even more exciting ;)
I guess suggestions on search queries is more complex since you're now talking about sentences. I'm not convinced that web text is nothing like the system dictionary though – that dictionary has 300 thousand words, which dwarfs even the most advanced user of the English language's lexicon. Also, proper nouns and new words still follow the phonetic structure of your base language, so will compress well, but you're right on acronyms and non-words. Do you have any references on the size of real-world vocabularies? Obviously personal experience with implementing such systems counts for a great deal.
Finally, I agree that a non-naive trie is the best data structure for this (in fact, they seem to be the best data structure full-stop – some tries beat hash tables on lookup by index) but if you are building on the primitives you've got, sorted sets don't seem too bad.
Yes, it should have been O(N * k^2). I just didn't think it was as bad as O(N^2). (Of course if you switch to Unicode, your trie gets even more exciting ;)
I guess suggestions on search queries is more complex since you're now talking about sentences. I'm not convinced that web text is nothing like the system dictionary though – that dictionary has 300 thousand words, which dwarfs even the most advanced user of the English language's lexicon. Also, proper nouns and new words still follow the phonetic structure of your base language, so will compress well, but you're right on acronyms and non-words. Do you have any references on the size of real-world vocabularies? Obviously personal experience with implementing such systems counts for a great deal.
Finally, I agree that a non-naive trie is the best data structure for this (in fact, they seem to be the best data structure full-stop – some tries beat hash tables on lookup by index) but if you are building on the primitives you've got, sorted sets don't seem too bad.