You’re correct about algorithms that do “human” things with text, but you need t...

samus · on Oct 19, 2023

Is such code really going to be ported to WASM though? And does it really matter for the string lengths that a typical web application has to process? WASM really doesn't have to worry about legacy that much.

SAI_Peregrinus · on Oct 19, 2023

Hashing algorithms and checksums work on bytes, not characters.

hyperpape · on Oct 19, 2023

Here is the JDK 7 String#hashCode(), which operates on characters: https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89....

That's changed in the newer versions, because String has a `byte[]` not a `char[]`, but it was just fine. A hash algorithm can take in bytes, characters, ints, it doesn't matter.

In Java, you don't get access to the bytes that make up a string, to preserve the string's immutability. So for many operations where you might operate on bytes in a lower level language, you end up using characters (unless you're the standard library, and you can finagle access to the bytes), or alternately doing a byte copy of the entire string.

I admit, checksums using characters are a bit weird sounding, but they should also be perfectly well-defined.

samus · on Oct 19, 2023

A possible optimization would be to change internal representation on-the-fly for long-ish strings as soon as random accesses are observed. Guidance from experiments would be required to tell where the right tresholds are. Also JavaScript implementations already do internal conversions between string implementations.