In Python len() on a bytes type gives you the number of bytes, and len() on a st...

lmm · on June 2, 2023

The argument is that indexing by codepoint is even less useful than indexing by byte.

chasontherobot · on June 2, 2023

As someone who has done both, I'd say that argument is wrong. It is much more convenient to index by code point. Indexing by bytes is almost always what you don't want to do, and leads to a lot of errors.

lmm · on June 2, 2023

What were the use cases where you found it useful to index by code point (and therefore not by grapheme cluster)?

arp242 · on June 2, 2023

In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

A codepoint is the "smallest useful addressable unit" when dealing with Unicode text, so it makes sense that's the default.

It's also comparatively expensive to address grapheme clusters.

lmm · on June 4, 2023

> In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

I can see that iterating through by codepoint could be useful for some of those cases, but I still can't see why you'd ever want to index by codepoint?

arp242 · on June 7, 2023

For the same reason you want to index anything: to slice, remove, etc. stuff. e.g. to replace a skin tone in an emoji: "str[i] = 0x1f3ff", or to insert one: "str = str[:i] + 0x1f3ff + str[i:]".

lmm · on June 12, 2023

But that's a pointlessly inefficient way to do it - surely what you want there is to iterate and transform rather than scan through and then slice? (And don't you need to group by extended grapheme cluster rather than codepoint anyway for that to make sense?)