The japanese example is interesting - because wc really rather depends on the la...

e12e · on May 12, 2014

Please note that:

   $ echo "wc can't count æøæ either" |wc
      1       5      29
   $ echo "wc can't count aaa either" |wc
      1       5      26

[edit: Also, note that Japanese is both left-to-right and top-to-bottom,right-to-left]

lifeisstillgood · on May 12, 2014

Is there a way to divide the Unicode world into ranges, with some clearly marked as "will work with this approach" and others marked differently.

A sort of code-pages approach - but we all work on the same Unicode foundation, just when it comes to Japanese a non-speaker like me would gracefully down-scale all the operations to "print and then suggest we hire some people to write an extension".

Its I guess linking LOCALE to a number range...

EdiX · on May 13, 2014

wc counts bytes, to make it count characters use -m in the GNU version.

lifeisstillgood · on May 13, 2014

I think the point being made is that -m does not count characters, it counts multi-bytes. Or at least tries to. So the same Unicode point in utf-8 and utf-16 (and utf-32) could be very different strings of bytes. No way to tell unless you know before hand you are dealing with utf-8 or 16. Hence BOM, but no one likes that.

Its hard. And possibly we have to abandon tools like wc when we leave the Latin world.