How ASCII lost and unicode won

joshuaellinger · on July 9, 2013

ASCII is by far the most successful character encoding that computers have used. It was invented in 1963, back in the era of punch cards and core memory. Modern RAM did not exist until 1975 -- a decade later.

Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.

Personally, I deal with data all the time and rarely encounter unicode. Of course, I'm in the US dealing with big files out of financial and marketing databases. In fact, I've seen more EBCDIC than UNICODE.

chongli · on July 9, 2013

>64-bit IP addresses are the replacement for 32-bit IP addresses

IPv6 addresses are 128-bit.

VLM · on July 9, 2013

"Modern RAM did not exist until 1975 -- a decade later."

What does that even mean? It doesn't mean DIP packaged DRAM because my dad was buying COTS Intel 1103's in 1971 or so before I was even born. And the first "I'm gonna store one bit of data in a capacitor" was done over the pond in the .uk during WWII at their code breaking plant.

"like 64-bit IP addresses are the replacement for 32-bit IP addresses."

Um...

joshuaellinger · on July 9, 2013

I just looked it up on Wiki. I remember my dad showing me core memory (after it wasn't in production) in the late 70s.

lotsofcows · on July 9, 2013

It's not even really a replacement. More of an extension. Anything that talks Unicode can talk ASCII (although not vice versa).

qznc · on July 9, 2013

Again the common mistake to confuse codes and encoding.

ASCII is an encoding which can represent a very small subset of Unicode.

UTF8 is an encoding which can represent all of Unicode and it is a superset of ASCII.

UTF32 can represent all of Unicode, but is not a superset of ASCII.

Therefore, something can talk Unicode via UTF32 without being able to talk ASCII.

brudgers · on July 9, 2013

Joel Spolsky provides the necessary (in Joel's opinion of 2003) minimum background in, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

tome · on July 9, 2013

But the collection of Unicode codepoints is an extension of the ASCII codepoints, isn't it? (Regardless of how they are encoded).

VLM · on July 9, 2013

I think its confusing the definition of "talk". Software written to handle UTF-8 can literally simply have 7bit ASCII shoved thru them. No translation necessary, no software modification necessary or anything.

On the other hand software speaking UTF-32 isn't going to tolerate ASCII being shoved thru them. The output is going to be 1/4 the length you expected and probably a mass of random asian glyphs. You CAN write a shim that turns each 7 bit ascii char into a 32 bit UTF-32 char and it'll work 1:1 perfectly for all 128 characters in ASCII. But outta the box, no you have to write a shim.

Now if you REALLY want to confuse people, after they 100% understand ASCII, UTF-8, and UTF-32, then feed some UTF-16 into something designed to eat either UTF-32 or ASCII or UTF-8. If you understand what a byte order mark is, and why its important, and how to use it, then you're along the way to understanding UTF-16.

This was a common whine about unicode in the early days that all you've done is trade the agony of multiple extended ascii code pages for the agony of multiple formats to represent unicode... Should have just spec'd UTF-8 and no other encoding and been done with it... or maybe UTF-32.

I will say that a world where only UTF-32 exists would be a world with a lot more text compression of webserver responses and stuff like that. It wouldn't be an automatic end of the world.

ygra · on July 9, 2013

Yes, but that's the distinction between character sets and encodings. A character set is just that, a set of characters. In this sense Unicode is a superset of ASCII, Latin 1, and every other pre-existing character set. Encodings map characters to byte( sequence)?s and they differ a lot in that respect.

Practically both character sets and encodings prove ample opportunities to mangle input, either by replacing unavailable characters by ? or �, or by mapping to the wrong characters because the wrong encoding was assumed (or by throwing errors because the input made no sense).

tome · on July 9, 2013

I mean when ASCII assigns a number to a character, Unicode assigns the same number to the character. UTF16 and UTF32 may encode those numbers differently from how ASCII encodes them, but they are nonetheless the same numbers.

Unicode is not an extension of EDBIC or Latin1 because characters under them have different codepoints than they do under Unicode.

[Please correct me if I'm wrong. I'm not an expert but this is how it seems to me]

DougBTX · on July 9, 2013

> [Please correct me if I'm wrong. I'm not an expert but this is how it seems to me]

You're basically right, but the subtle point is that these comments are in response to lotsofcows who said, "Anything that talks Unicode can talk ASCII (although not vice versa)."

The implication of "talk" is that the numbers are being encoded. qznc was trying to clarify that lotsofcows's statement is only true if the Unicode system is using an encoding which is backwards compatible with 7-bit ASCII, such as UTF-8.

tome · on July 9, 2013

Thanks, I see :) My comment wasn't particularly relevant.

VLM · on July 9, 2013

"Unicode assigns the same number to the character."

Extremely close but not quite right for some weird corner cases. There's either a cool feature or hideous bug in the unicode design, depends how you look at it, that lets you compose multiple unicode codes together into one glyph. So you can say, "Gimme the code for A with a bar over it" or "Gimme the code for A, now gimme the code for put a bar over the last glyph". Even worse you can stack compose characters, so you can write "A with circle on top and squiggly underneath" at least five different ways in binary that should be rendered visually identical.

There is the one true normalization technique that most people use to convert what they consider poor unicode grammar into a standard form. Some people don't use it of course. And anytime you give users freeform input who knows what kind of crud they'll feed you. So you can never really assume any unicode string is normalized unless you personally normalized it yourself. Even then theoretically two normalized strings, concatenated, might or might not be normalized anymore (although this is often not much of a problem). And this strikes substring manipulation too.

A fun source of buffer overflows is normalization can shrink OR EXPAND a unicode string, in theory. So if you use one of those languages without variable length strings, look out. Or if the language tries to take care of it for you, this leads to weird memory fragmentation. Maybe a DOS/DDOS opportunity?

Which leads to philosophical argument you must decide in your code, if two bitstreams don't match, but you get matching glyph renderings, from the program's point of view are they a match or not? Depends if what your program is trying to do I guess. Most languages have a library to handle this. Writing your own unicode handler functions is a "here be dragons" moment.

Composing characters are also a fun source of swear words if you're trying to count the number of characters in a file. Something that renders literally identically will have an identical number of characters but somewhat varying number of bytes.

Other fun philosophical arguments are if you read in a un-normalized string and output a normalized string have you changed the string? Well, both no, and yes. I'm sure there's crypto steagongraphy implications.

Google for "unicode normalization form nfc" and stuff like that.

The first page I found with a good faq was:

http://www.macchiato.com/unicode/nfc-faq

ygra · on July 9, 2013

Someone below noted already that grapheme clusters are, at least for dealing with humans, a much better thing to talk about than bytes or abstract characters (code points). The statement "Unicode assigns the same number to the character." was indeed correct, though. It's just that character is such an overloaded word (meaning either byte, code point or grapheme); but for Unicode at least it is synonymous with code point.

qw · on July 9, 2013

AFAIK all Latin1 characters codepoints with Unicode too.

cygx · on July 9, 2013

The ASCII coded character set is a subset of Unicode, which is not true for arbitrary codesets, even if their character repertoire is a subset of the Unicode one.

sixbrx · on July 9, 2013

But character sets are not just sets of characters, they are assignments of abstract glyphs to integers, regardless of the storage representation of the integer.

ygra · on July 9, 2013

Eep, funny how I can make the same mistake twice in ~five years. You are right, of course.

Character sets therefore could be modelled as a partial function c : Character ↛ ℕ, while the encoding is a partial function e : ℕ ↛ seq Byte. Something like that.

ASCII and Latin 1 are therefore both subsets of Unicode, even though only ASCII is a subset of UTF-8 while no comparable Unicode encoding exists for Latin 1 or any other encoding.

shrikant · on July 9, 2013

I know what you mean.

I've worked on automated data submissions for banks in the UK, and the insistence on fixed-width, EBCDIC encoded data files for many regulatory filings (FSA, credit rating agencies) was annoying. On the other hand, it was so easy to automate in a VBA macro that I could have quite a bit of free time.

thomasjames · on July 9, 2013

It indeed was so widespread and lasted so long that the Arab world developed a really clever standard transliteration system. https://en.wikipedia.org/wiki/Arabic_chat_alphabet

timthorn · on July 9, 2013

I really hate to nitpick, but the article implies that ASCII was the first character encoding. In fact, there was a rich history of different encodings before that, with different word sizes and/or incompatible 8 bit encodings. It's quite interesting to look back and see what trade-offs were made and why.

ygra · on July 9, 2013

Well, it's the oldest character set and encoding that's still semi-relevant today. I doubt many people nowadays encounter EBCDIC and the like (and if they do, the article isn't aimed at them, I guess).

VLM · on July 9, 2013

But it misses some relevant semi-dramatic story.

For all practical purposes EBCDIC is the "IBM Standard" as opposed to ASCII's "American Standard". The mood in the 70s/80s outside the business world was "In your face IBM!"

And putting "Information Interchange" in the name itself is another "In Your Face" posturing to the mainframe world. We're the future of data transmission and you'd best get used to it, IBM...

ASCII really was a rebellion in the olden times. One that won.

Another story was before ASCII note that teletype codes and such were usually modal, LTRS/FIGS to switch from 5-bit letters to 5-bit numbers. So there's that dramatic great circular wheel of IT where we've oscillated both before and after ASCII between simple encodings and modal encodings. This was an early whine against unicode, who cares about codepages, just embed it like, or in, what amounts to the MIME media type, and glyph-like Asian languages should just be drawn in gif files anyway. Or so the complaints went at that time.

Another design statement story: Kind of like the Uni- in unicode uniting all the extended code pages into one really huge space.

There are other dramatic stories not in the article. For example the Klingon in Unicode movement. Basically about 15 years ago they tried to get Klingon script into unicode, about a decade ago the unicode people (who?) said no, so the Klingon people squatted in the unicode equivalent of what networking people would call RFC1918 space, and its simmered since then. Will Klingon actually make it into unicode officially or not, who knows. You can add to the fire by pointing out that Unicode is already stuffed with scripts that no living human culture currently uses, and numerous glyph symbols (think of math stuff like + or %) On the other hand this would inevitably result in Tolkien Elvish script being included in unicode. And does this really matter one way or the other? And this maps perfectly into the wikipedia battle between the deletionist (expletive deleted) and the inclusionist saviors of humanity. Well maybe that last line was a little biased to my opinions...

There's tons of extra fun drama to tell about unicode and cutting off the story at the founding of ASCII misses some of the fun drama. Aside from some fun drama that wasn't mentioned in a post aimed at non-techs who we're told really like drama. You could probably turn the Unicode story into a trashy reality TV show somehow; Vampire Romance Fiction is going to be a harder translation although I'd love to see it.

kps · on July 9, 2013

IBM was actually quite involved in the development of ASCII. See Charles E. MacKenzie, Coded Character Sets: History and Development [1] and Bob Bemer's stuff [2].

[1] http://openlibrary.org/works/OL8019369W/Coded_Character_Sets

[2] http://www.bobbemer.com/

VLM · on July 9, 2013

True. Much as they were deeply involved in PCs. But the mainframe people and the desktop people identified pretty strongly with their respective character sets in the old days.

There is no technological reason EBCDIC couldn't have been the encoding for the whole desktop revolution, other than the dramatic central control vs local control, mainframe vs desktop thing.

There is some truth to the claim that whenever IBM reached a fork in the road they usually found a way to go both ways, at least for many years in the olden days.

mcherm · on July 9, 2013

I encounter it occasionally. (Although I am certainly not the article's target demographic.)

huhtenberg · on July 9, 2013

> What’s a bit?

> What’s a byte?

I'm not sure what the target demographic is.

drdaeman · on July 9, 2013

> What’s a byte?

> A byte is a set of 8 bits.

Nitpicking, that's octet, not byte. I believe the distinction was lost and don't think byte has a proper definition anymore.

fosap · on July 9, 2013

But for historical encodings that kind of matters. Because a byte was not always 8 bit we have to deal with utf-7 and had to ensure things where 8 bit clean. And utf-9.

chiph · on July 9, 2013

Baudot, baby!

Had the advantage that an open carrier (all zeros) mapped to NULL so you didn't waste paper (either tape or roll).

nknighthb · on July 9, 2013

Had? Baudot is alive and well on the HF bands.

http://en.wikipedia.org/wiki/RTTY

chris_wot · on July 9, 2013

Yeah, but does anyone still use the Gauss-Weber Telegraph Alphabet? [1]

1. http://www.randomtechnicalstuff.blogspot.com.au/2009/05/unic...

D9u · on July 9, 2013

In the beginning there was smoke signals

salmonellaeater · on July 9, 2013

The fact that UTF-8 and UTF-16 are often exposed to programmers when dealing with text is a major failure of separation-of-concerns. If you had a stream of data that was gzipped, would it ever make sense to look at the bytes in the data stream before decompressing it? Variable-length text encodings are the same. Application code should only see Unicode code points.

In general it was a mistake to put variable-length encodings into the Unicode standard. A much better design would have been to use UTF-32 for the application-level interface to characters, and use a separate compression standard that is optimized for fixed alphabets when transporting or storing text. This has the advantage that the compression scheme can be dynamically updated to match the letter frequencies in the real-world text, and it logically separates the ideas of encoding and compression so that the compression container is easier to swap out. And, of course, an entire class of bugs would be eliminated from application code.

Edited first paragraph to clarify: Variable-length text encodings are the same.

Someone · on July 9, 2013

I agree that that is the ideal end situation, but Unicode would have been dead on delivery if they had chosen that approach. Memory just was too expensive at the time to make a system that, in most of the computer-using world, wasted 75% in every text string. And no, just-in-time decompression wouldn't have worked either; CPU cycles also were too expensive at the time to do that.

Unicode also would have been too incompatible with existing code that copied 8-bit character strings around. See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for some rationale behind UTF-8.

cygx · on July 9, 2013

That makes it sound as if UTF-32 would be a silver bullet - it's not. Application code normally has to deal with user-perceived characters/glyphs/grapheme clusters, which means you'll have to treat UTF-32 as variable-length as well.

If you want to get back the supposed benefits of UTF-32, you'll have to dynamically assign codepoints to grapheme clusters.

est · on July 9, 2013

This looks like how Python3 handles Unicode. 1byte for ASCII, 2 bytes for BMP, 4 bytes for everything else.

ygra · on July 9, 2013

I'm impressed. Easily readable and understandable, short and as far as I can tell no factual inaccuracies and wrong information (unlike many other Unicode introductions and tutorials).

qznc · on July 9, 2013

You want some criticism, because we have too little of that here on HN? I'll bite. ;)

"A byte is a set of 8 bits. Computers typically move data around a byte at a time."

A byte being 8 bits is ok. Historically, a byte might have a different number of bits, but all modern architectures use 8 bits. Since this is an introductory article, this is fine. (more details: http://en.wikipedia.org/wiki/Byte)

Computers do not typically move data in byte chunks. You could say "a byte is the smallest unit of data a CPU can load or store". If you talk about moving data, the question is between what. Probably memory. However, there are caches nowadays, since bandwidth is cheap and latency is expensive. Data is moved in cache line chunks, which means 4-64 byte chunks depending on architecture and cache level. Bigger chunks in upcoming architectures.

TheNewAndy · on July 9, 2013

I wouldn't say "all modern architectures use 8 bits". Plenty of DSPs have > 8-bits (SHARCs are 32-bit, ZSPs are 16, 56K-like DSPs are 24.

A byte is the smallest thing that a pointer can point at. Something like MIPS (from memory) has 8-bit bytes, but a load instruction loads 4-bytes at a time (and although it has the unaligned load functions which can be convinced to load a single byte, they aren't general purpose)

chiph · on July 9, 2013

IIRC, the IBM System 360 is the machine that really set the 8 bits == 1 byte convention in stone.

Before that you had CDC machines with 6-bit bytes (or Nybbles if you owned an Apple ][ with a floppy disk)

mjn · on July 9, 2013

There's some subjectivity in where you consider it standardized, but I would put it later than that. Throughout the period when IBM's System/360 was popular, DEC's PDP series was also popular, which used 36-bit words and a somewhat varying definition of what constituted a byte.

If I had to pick a date, I'd pick 1977 as when it solidified. A few elements: ARPAnet standardized on octet-based protocols in the early 1970s, and was getting widespread by the mid 1970s; three popular microcomputers based on 8-bit architectures were introduced in 1977 (the Apple II, TRS-80, and Commodore PET); and DEC introduced the 32-bit VAX in 1977.

kalleboo · on July 9, 2013

And thanks to good old SMTP originally not being specified as 8-bit, we get freakshows like UTF-7 http://en.wikipedia.org/wiki/UTF-7

npongratz · on July 9, 2013

I knew bytes were ambiguously sized, but I had thought a nibble/nybble was (A) always half an octet, and (B) a fun Apple coders magazine [0].

But sure enough, I was very wrong on (A): Apple II disk writes were done in "sets of 5-bit or, later, 6-bit nibbles" [1].

Thank you, chiph!

[0] http://www.nibblemagazine.com/ [1] https://en.wikipedia.org/wiki/Nibble#History

chiph · on July 9, 2013

There's probably some variation in the terminology, but I always used these terms:

    bit
    nibble (4 bits)
    nybble (6 bits)
    byte (8 bits)
    word (16 bits)
    :

gnosis · on July 9, 2013

I liked the explanation at the beginning of the Strings chapter in the Dive Into Python 3 tutorial:

http://getpython3.com/diveintopython3/strings.html

aidos · on July 9, 2013

I think this is great explanation for anyone who's approaching the subject for the first time. It gives a good introduction as to why you're staring at broken data coming from a db and just how royally screwed your afternoon is going to be getting back into shape :)

rjh29 · on July 9, 2013

What do you think about http://richardharr.is/unicode-in-five-minutes.html ?

Digit-Al · on July 9, 2013

>ASCII really should have been named ASCIIWOA: the American Standard Code for Information Exchange With Other Americans.

So he thinks Americans are the only people to use the English language does he?

kps · on July 9, 2013

It's worse than that, actually, as ASCII from the start¹ included provisions for variants for non-English latin characters and alternate currency symbols, and ASCII was essentially the same project as ECMA-6² (ECMA being the European Computer Manufacturers' Association³, a standardization group founded in 1961).

ASCII as we know it (which is essentially the 1967 version⁴) like the corresponding ECMA standard⁵ provided for overloading punctuation characters as diacritics ("/¨ ^/ˆ ~/˜ '/´ ‘/` ,/¸) to be overstruck in typewriter fashion; ECMA-35⁶ (1971⁷) defines further extension techniques using control and/or escape sequences.

So, yes, it's just a failed attempt at an anti-American cheap shot from someone who isn't familiar with the development of character set encodings.

¹ American Standard Code for Information Interchange, http://www.wps.com/projects/codes/X3.4-1963/index.html

² 7-bit Coded Character Set, http://www.ecma-international.org/publications/standards/Ecm...

³ http://www.ecma-international.org/default.htm

⁴ http://www.wps.com/J/codes/Revised-ASCII/index.html

⁵ 7-bit Input/Output Coded Character Set, 4th Edition is unfortunately the oldest available online; http://www.ecma-international.org/publications/files/ECMA-ST...

⁶ Character Code Structure and Extension Techniques, http://www.ecma-international.org/publications/standards/Ecm...

⁷ Extension of the 7-bit Coded Character Set, http://www.ecma-international.org/publications/files/ECMA-ST...

Trufa · on July 9, 2013

I'm guessing it depends on the language. In spanish we have only one type of accent and only in vowels: áéíóú, and the ñ.

The ñ is right next to the "L" key, and the tilde (accent) is either right next to the ñ or on top of it.

This adds some sort of complexity, but in my experience, the average user simple expects the key to be "where it always was" and I've had to "fix the problem" (changing the input method) many many times.

To answer your question I think that the amount of users that actually take note of what happened and understand it enough to fix it again is minimal, the rest just expect it to work and ask for help when it doesn't.

There are also a lot of variations of spanish keyboards, so it just makes the matter more complicated... I use *unix, Windows and OSx almost interchangeably and know how to change the input language in most of them to ISO spanish spanish quite quickly, but I'm not representative in that regard.

Slightly offtopic but the spanish spanish keyboard layout is extremely comfortable for programming...

pfortuny · on July 9, 2013

No, he says that because there is no Pound, Shilling or other specific symbols for English-speaking countries, I guess.

peterkelly · on July 9, 2013

Another good article on this topic is the one by Joel Spolsky:

http://www.joelonsoftware.com/articles/Unicode.html

gnosis · on July 9, 2013

"Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length."

Is this really true? My impression was that UTF-32 is a fixed-length encoding which uses 32 bits to encode all of Unicode. It seems that this means that Unicode can never have more code points than could fit in 32 bits. Right?

ygra · on July 9, 2013

Unicode can never have more code points than could fit into 21 bits. Because Unicode is a 21-bit code. This has historical reasons because Unicode was a 16-bit code initially and it soon became apparent that 65536 characters are not enough, especially with the commitment of

a) compatibility to every pre-existing character set and

b) including historical scripts too¹

21 bits was what emerged from expanding UCS-2 to UTF-16 via surrogate pairs and Unicode was reörganised into 17 “planes”, the first of which, the BMP, containing all code points allocated so far. UTF-32 then just was a simple encoding scheme that allows to have one code point per code unit that is also efficient to process. 21- or 24-bit code units would be unwieldy on most architectures (especially regarding unaligned memory access).

___________

¹ Arguably the decision to include Emoji made a bigger dent in the code point space than hieroglyphs, Linear [AB], etc., though, but that came a litte later.

cygx · on July 9, 2013

Unicode code points go up to 0x10FFFF, which fits in 21 bits.

If this were ever to become a problem (which I don't see happening any time soon), the transition from UCS-2 to UTF-16 is prior art on how to pull off the extension of a coding space.

Somewhat unrelated, but nevertheless worth mentioning as it's a common misconception: While UTF-32 is a fixed-length coding for Unicode characters, often, the more interesting unit is the grapheme cluster, effectively making UTF-32 into a variable-length coding.

peteri · on July 9, 2013

Think you're right. The article could with a rewrite as Unicode used to be 16 bit until 1996 (according to Wikipedia) which explains why Java/Windows are really UTF-16 based.

The more interesting question is if you're designing a new operating system would you pick UTF-8 or UTF-32 as the basis of your character system. Bearing in mind you need to normalise strings anyway for comparison purposes the general space efficiencies for UTF-8 for most systems seem tempting.

lucian1900 · on July 9, 2013

Why would one not use UTF-8? You can't do byte comparisons or indexing, so all processing has to be linear anyway.

alexjeffrey · on July 9, 2013

I believe it depends on the encoding - UTF-32 has a limit of 2^32 characters, but UTF-8 could potentially expand infinitely (ignoring the current arbitrary 4-byte limit).

cygx · on July 9, 2013

That's incorrect - the UTF-8 scheme encodes the sequence length in the first byte as the number of bits set to 1 until the first 0.

The original limit was 6 bytes encoding 31 bits, but this was lowered to 4 bytes encoding 21 bits. If I'm not mistaken, 7 bytes encoding 36 bits (or 8 bytes encoding 42 bits if you make the 0 optional) should be the hard limit.

drdaeman · on July 9, 2013

If we ignore this artificial restriction, we can use arbitrary lengths, turning UTF-8 into varint encoding.

I.e.:

      Oct.1    Oct.2    Oct.3
    11111111 110DDDDD DDDDDDDD (7 octets more)
    \_________/
      10 bits

   ("D"s represent data bits)

This is what'd be necessary to use UTF-8 if we'd need to go beyond the current limits of 0x010FFFF upper bound.

cygx · on July 9, 2013

In your scheme, we'll end up with multiple leading bytes, though.

Also, how do you plan to handle 9 leading bits instead of the 10 from your example? That would make the second byte start with 10, the marker for continuation bytes.

drdaeman · on July 9, 2013

Ouch, you're right. Refreshed my memory of UTF-8 encoding and now it seems it's designed so it can't be made to express arbitrary values. Not sure why I imagined the opposite. My mistake.

adsr · on July 9, 2013

The current limit (0x10FFFF) is the artificial restriction on UTF-8 to match UTF-16, not the upper bound. UTF-8 supports 2^31 code points with it's current encoding. So nothing needs to change until Unicode consists of 2.1 billion code points or so.

In my mind that will only happen as a result of extreme carelessness.

okwa · on July 9, 2013

> These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.

I don't think it's mere coincidence that the capital letters start at 65 and the lower case at 97 and the decimal digits at 48.

stuartcw · on July 9, 2013

It's not a matter of winning or loosing. The pre-unicode mix of character sets was a mess when it came internationalization. Try truncating a Japanese Shift-JIS string in C. That will learn you..

ygra · on July 9, 2013

Arguably Unicode (UTF-8 and -16) doesn't necessarily make this any easier. Or any variable-length encoding, really. You see halved code points quite frequently, and if not that, then halved " and the like.

danso · on July 9, 2013

OT and out of curiosity...how do non-native English speakers experience typing/keyboard education? I can barely remember how to make any of the basic accents over the `e` when trying to sound French...are typing classes in non-English schooling systems much more sophisticated than in English (i.e. ASCII-centric) schools? I wonder if non-native English typists come away with a better handling of the power of keyboard shortcuts (whether to create accents or not)

kalleboo · on July 9, 2013

Most countries just have different keyboards so they don't need any dead keys. Look to the right of P/L http://www.danielschlaug.com/journal/wp-content/uploads/2011...

yebyen · on July 9, 2013

What do you do without square and curly braces, I'm curious? Everyone writes only lisp and html? :)

lucb1e · on July 9, 2013

In the Netherlands we use the "US International" keyboard, which is basically qwerty like you're used to but with dead keys to make áéàüetc. We don't have that many odd characters so it's okay not to have special keys for it, but I'd agree with anyone saying that even these few are too many. If we're gonna clean the language up though, let's also get rid of other overhead. For example the sentence "If we clean language though, rid other overhead" makes quite a lot of sense.

Other countries that make even less sense and use way too many special characters (from my blunt perspective) usually have different keyboards. Prime example that I know of is France where you have to use the shift key to make numbers.

To answer what you actually asked: Yes we are taught how to make those special characters. For me it's as logic as typing parenthesis or the euro sign, but I'm spending most of my waking hours typing one thing or another. Many people don't or barely know how to.

It's not what most of the time is spent on in those courses. We're being taught the asdfjkl; row just like everyone else. Or aoeuhtns, depending on your keyboard (in the Netherlands we only have qwerty though). Making accents is more of a side thing that's mentioned once or twice after learning everything else at an acceptable speed.

eric_the_read · on July 9, 2013

Possible interpretations of "If we clean language though, rid other overhead":

* Those guys over there should|etc. get rid of other overhead.

* I can't type "ride". What does "...ride other overhead" mean? I dunno, but people write incoherent things all the time.

IOW, what you think of as "overhead" probably isn't, especially with natural language.

mattengi · on July 10, 2013

In Korea, almost all keyboard is just plain QWERTY keyboard with additional Hangul Printing.[0]

Right Alt key used as Hangul(Korean)/English toggle, right Ctrl key as Hanja.

When toggled to Hangul, only English characters are overridden by Hangul characters. All numbers, symbols are also same when you are in English typing mode.

Basically no additional key in there compared to QWERTY.

Is that complex and hard learning type in Hangul? Nope. Maybe 'Korean' is complex to learn, but 'Hangul' - I mean, script? character composition system? sort of that - is quite simple.[2]

Actually It's capable of implement more efficient input layout than English especially more restricted environment. Like basic cell phone key layout(E.161).[1]

There was a King, and He was really great hacker. Because he was a King, he grabbed bunch of smart guy all around country. ; ) Then push them working hard. (did I said he was King?) Therefore, invented many good one for country people. Today Korean has own quite good and expressive characters and he deserved quite good place.[3]

[0] http://i.imgur.com/j0Xk6oY.jpg

[1] http://bit.ly/11CF0mS

[2] http://blog.naver.com/PostView.nhn?blogId=neraijel&logNo=110...

[3] http://i.imgur.com/69jkSXa.jpg

salmonellaeater · on July 9, 2013

Chinese has many different input methods. For example, in Taiwan most keyboards have the symbols for 4 different input methods printed on them: English/Pinyin on the top left, Zhuyin on the top right, Cangjie on the lower left, and Dayi on the lower right. The top two are phonetic, and the bottom two are symbolic.

Chinese input methods typically require a sequence of key-presses and then a selection from a menu of matching characters, with the most common matches first. Multi-character sequences can be entered without making a choice until the end, in which case the most likely n-grams come first.

Most people in Taiwan learn phonetic (Zhuyin) in school, which is very easy, as long as you know standard pronunciation.

http://en.wikipedia.org/wiki/Keyboard_layout#Taiwan

gilgoomesh · on July 9, 2013

Japanese often type in "Romaji" (a romanized transliteration of Japanese). Basically, they type phonetically in roman characters and the computer suggests/autocompletes characters which match the phonetics. It's a little like typing SMSs in T9 on old numberpad phones.

aroberge · on July 9, 2013

I switch (with a key combination) between a French keyboard, when I need to write text in French with accents, and an English keyboard when I need to type various types of brackets e.g. [] and {}. I actually know two different French keyboard layouts ... and can touch type fairly efficiently using any of the three. I don't consider myself a very efficient user of keyboard shortcuts.

estel · on July 9, 2013

The easiest solution is to use country specific keyboards that include characters most needed by that language ie. http://en.wikipedia.org/wiki/File:KB_France.svg

lmm · on July 9, 2013

Given the controversy over Han unification, I suspect that incompatible charactersets will be with us for a while yet, more's the pity.

lelf · on July 9, 2013

Well, when 99% think unicode = encoding = ucs2 = utf-16, don't believe there's something outside BMP, and wtf is the only word coming to their mind when they hear about graphemes… Unicode won?

rayiner · on July 9, 2013

Unicode, meh. Nobody will ever need more than 128 characters.