Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The japanese example is interesting - because wc really rather depends on the language. So does regex. And quite a lot of other things that are useful in a Latin-derived world kind of get harder in a right to left inflected written language (if there is one, some Arabic comes to mind).

I think if anything will force us to rethink the underlying assumptions of Unix, its unicode.



Please note that:

   $ echo "wc can't count æøæ either" |wc
      1       5      29
   $ echo "wc can't count aaa either" |wc
      1       5      26
[edit: Also, note that Japanese is both left-to-right and top-to-bottom,right-to-left]


Is there a way to divide the Unicode world into ranges, with some clearly marked as "will work with this approach" and others marked differently.

A sort of code-pages approach - but we all work on the same Unicode foundation, just when it comes to Japanese a non-speaker like me would gracefully down-scale all the operations to "print and then suggest we hire some people to write an extension".

Its I guess linking LOCALE to a number range...


wc counts bytes, to make it count characters use -m in the GNU version.


I think the point being made is that -m does not count characters, it counts multi-bytes. Or at least tries to. So the same Unicode point in utf-8 and utf-16 (and utf-32) could be very different strings of bytes. No way to tell unless you know before hand you are dealing with utf-8 or 16. Hence BOM, but no one likes that.

Its hard. And possibly we have to abandon tools like wc when we leave the Latin world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: