Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not advocating you write this, I'm saying people have written it, probably hundreds of thousands of times, and if charAt() becomes O(n) instead of O(1), this code suddenly hangs your CPU for 10 seconds on a long string, thus you can't really swap out UTF-16 for UTF-8 transparently.


Your point doesn't stand for UTF-16 either. It's not a fixed length encoding either. It's broken in UTF-16 as well.

It was always O(n).

Of course assuming you aren't using UTF-32, which has its own set of problems (BE or LE), and sees little usage outside of China.


...it's not O(n). Many languages, JS, Java and C# included, have O(1) access to a character at a given position. You correctly note that it won't work well with international strings, but GP is right that A LOT of code like this was written by western ASCII-brained developers.


Haven't used Java in a while but I believe charAt() returns a UTF-16 codepoint and is constant time access. So something like the above works not only for ASCII, as well as for the majority of Western languages and special characters you may encounter on a day to day basis.


It's constant time iff you ignore surrogate pairs and Unicode. By that logic UTF8 is constant time if you ignore anything not ASCII because most text is in English.

Saying it works fine if you ignore errors and avoid edge cases is just a clever rephrashing of it worked on my machine.

Plus Emojis are Unicode U+1F600 and above, so even in Western language you are bound to find such "exceptions" .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: