ASCII is by far the most successful character encoding that computers have used. It was invented in 1963, back in the era of punch cards and core memory. Modern RAM did not exist until 1975 -- a decade later.
Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.
Personally, I deal with data all the time and rarely encounter unicode. Of course, I'm in the US dealing with big files out of financial and marketing databases. In fact, I've seen more EBCDIC than UNICODE.
"Modern RAM did not exist until 1975 -- a decade later."
What does that even mean? It doesn't mean DIP packaged DRAM because my dad was buying COTS Intel 1103's in 1971 or so before I was even born. And the first "I'm gonna store one bit of data in a capacitor" was done over the pond in the .uk during WWII at their code breaking plant.
"like 64-bit IP addresses are the replacement for 32-bit IP addresses."
Joel Spolsky provides the necessary (in Joel's opinion of 2003) minimum background in, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I think its confusing the definition of "talk". Software written to handle UTF-8 can literally simply have 7bit ASCII shoved thru them. No translation necessary, no software modification necessary or anything.
On the other hand software speaking UTF-32 isn't going to tolerate ASCII being shoved thru them. The output is going to be 1/4 the length you expected and probably a mass of random asian glyphs. You CAN write a shim that turns each 7 bit ascii char into a 32 bit UTF-32 char and it'll work 1:1 perfectly for all 128 characters in ASCII. But outta the box, no you have to write a shim.
Now if you REALLY want to confuse people, after they 100% understand ASCII, UTF-8, and UTF-32, then feed some UTF-16 into something designed to eat either UTF-32 or ASCII or UTF-8. If you understand what a byte order mark is, and why its important, and how to use it, then you're along the way to understanding UTF-16.
This was a common whine about unicode in the early days that all you've done is trade the agony of multiple extended ascii code pages for the agony of multiple formats to represent unicode... Should have just spec'd UTF-8 and no other encoding and been done with it... or maybe UTF-32.
I will say that a world where only UTF-32 exists would be a world with a lot more text compression of webserver responses and stuff like that. It wouldn't be an automatic end of the world.
Yes, but that's the distinction between character sets and encodings. A character set is just that, a set of characters. In this sense Unicode is a superset of ASCII, Latin 1, and every other pre-existing character set. Encodings map characters to byte( sequence)?s and they differ a lot in that respect.
Practically both character sets and encodings prove ample opportunities to mangle input, either by replacing unavailable characters by ? or �, or by mapping to the wrong characters because the wrong encoding was assumed (or by throwing errors because the input made no sense).
I mean when ASCII assigns a number to a character, Unicode assigns the same number to the character. UTF16 and UTF32 may encode those numbers differently from how ASCII encodes them, but they are nonetheless the same numbers.
Unicode is not an extension of EDBIC or Latin1 because characters under them have different codepoints than they do under Unicode.
[Please correct me if I'm wrong. I'm not an expert but this is how it seems to me]
> [Please correct me if I'm wrong. I'm not an expert but this is how it seems to me]
You're basically right, but the subtle point is that these comments are in response to lotsofcows who said, "Anything that talks Unicode can talk ASCII (although not vice versa)."
The implication of "talk" is that the numbers are being encoded. qznc was trying to clarify that lotsofcows's statement is only true if the Unicode system is using an encoding which is backwards compatible with 7-bit ASCII, such as UTF-8.
"Unicode assigns the same number to the character."
Extremely close but not quite right for some weird corner cases. There's either a cool feature or hideous bug in the unicode design, depends how you look at it, that lets you compose multiple unicode codes together into one glyph. So you can say, "Gimme the code for A with a bar over it" or "Gimme the code for A, now gimme the code for put a bar over the last glyph". Even worse you can stack compose characters, so you can write "A with circle on top and squiggly underneath" at least five different ways in binary that should be rendered visually identical.
There is the one true normalization technique that most people use to convert what they consider poor unicode grammar into a standard form. Some people don't use it of course. And anytime you give users freeform input who knows what kind of crud they'll feed you. So you can never really assume any unicode string is normalized unless you personally normalized it yourself. Even then theoretically two normalized strings, concatenated, might or might not be normalized anymore (although this is often not much of a problem). And this strikes substring manipulation too.
A fun source of buffer overflows is normalization can shrink OR EXPAND a unicode string, in theory. So if you use one of those languages without variable length strings, look out. Or if the language tries to take care of it for you, this leads to weird memory fragmentation. Maybe a DOS/DDOS opportunity?
Which leads to philosophical argument you must decide in your code, if two bitstreams don't match, but you get matching glyph renderings, from the program's point of view are they a match or not? Depends if what your program is trying to do I guess. Most languages have a library to handle this. Writing your own unicode handler functions is a "here be dragons" moment.
Composing characters are also a fun source of swear words if you're trying to count the number of characters in a file. Something that renders literally identically will have an identical number of characters but somewhat varying number of bytes.
Other fun philosophical arguments are if you read in a un-normalized string and output a normalized string have you changed the string? Well, both no, and yes. I'm sure there's crypto steagongraphy implications.
Google for "unicode normalization form nfc" and stuff like that.
Someone below noted already that grapheme clusters are, at least for dealing with humans, a much better thing to talk about than bytes or abstract characters (code points). The statement "Unicode assigns the same number to the character." was indeed correct, though. It's just that character is such an overloaded word (meaning either byte, code point or grapheme); but for Unicode at least it is synonymous with code point.
The ASCII coded character set is a subset of Unicode, which is not true for arbitrary codesets, even if their character repertoire is a subset of the Unicode one.
But character sets are not just sets of characters, they are assignments of abstract glyphs to integers, regardless of the storage representation of the integer.
Eep, funny how I can make the same mistake twice in ~five years. You are right, of course.
Character sets therefore could be modelled as a partial function c : Character ↛ ℕ, while the encoding is a partial function e : ℕ ↛ seq Byte. Something like that.
ASCII and Latin 1 are therefore both subsets of Unicode, even though only ASCII is a subset of UTF-8 while no comparable Unicode encoding exists for Latin 1 or any other encoding.
I've worked on automated data submissions for banks in the UK, and the insistence on fixed-width, EBCDIC encoded data files for many regulatory filings (FSA, credit rating agencies) was annoying. On the other hand, it was so easy to automate in a VBA macro that I could have quite a bit of free time.
I really hate to nitpick, but the article implies that ASCII was the first character encoding. In fact, there was a rich history of different encodings before that, with different word sizes and/or incompatible 8 bit encodings. It's quite interesting to look back and see what trade-offs were made and why.
Well, it's the oldest character set and encoding that's still semi-relevant today. I doubt many people nowadays encounter EBCDIC and the like (and if they do, the article isn't aimed at them, I guess).
For all practical purposes EBCDIC is the "IBM Standard" as opposed to ASCII's "American Standard". The mood in the 70s/80s outside the business world was "In your face IBM!"
And putting "Information Interchange" in the name itself is another "In Your Face" posturing to the mainframe world. We're the future of data transmission and you'd best get used to it, IBM...
ASCII really was a rebellion in the olden times. One that won.
Another story was before ASCII note that teletype codes and such were usually modal, LTRS/FIGS to switch from 5-bit letters to 5-bit numbers. So there's that dramatic great circular wheel of IT where we've oscillated both before and after ASCII between simple encodings and modal encodings. This was an early whine against unicode, who cares about codepages, just embed it like, or in, what amounts to the MIME media type, and glyph-like Asian languages should just be drawn in gif files anyway. Or so the complaints went at that time.
Another design statement story: Kind of like the Uni- in unicode uniting all the extended code pages into one really huge space.
There are other dramatic stories not in the article. For example the Klingon in Unicode movement. Basically about 15 years ago they tried to get Klingon script into unicode, about a decade ago the unicode people (who?) said no, so the Klingon people squatted in the unicode equivalent of what networking people would call RFC1918 space, and its simmered since then. Will Klingon actually make it into unicode officially or not, who knows. You can add to the fire by pointing out that Unicode is already stuffed with scripts that no living human culture currently uses, and numerous glyph symbols (think of math stuff like + or %) On the other hand this would inevitably result in Tolkien Elvish script being included in unicode. And does this really matter one way or the other? And this maps perfectly into the wikipedia battle between the deletionist (expletive deleted) and the inclusionist saviors of humanity. Well maybe that last line was a little biased to my opinions...
There's tons of extra fun drama to tell about unicode and cutting off the story at the founding of ASCII misses some of the fun drama. Aside from some fun drama that wasn't mentioned in a post aimed at non-techs who we're told really like drama. You could probably turn the Unicode story into a trashy reality TV show somehow; Vampire Romance Fiction is going to be a harder translation although I'd love to see it.
IBM was actually quite involved in the development of ASCII. See Charles E. MacKenzie, Coded Character Sets: History and Development [1] and Bob Bemer's stuff [2].
True. Much as they were deeply involved in PCs. But the mainframe people and the desktop people identified pretty strongly with their respective character sets in the old days.
There is no technological reason EBCDIC couldn't have been the encoding for the whole desktop revolution, other than the dramatic central control vs local control, mainframe vs desktop thing.
There is some truth to the claim that whenever IBM reached a fork in the road they usually found a way to go both ways, at least for many years in the olden days.
But for historical encodings that kind of matters. Because a byte was not always 8 bit we have to deal with utf-7 and had to ensure things where 8 bit clean. And utf-9.
The fact that UTF-8 and UTF-16 are often exposed to programmers when dealing with text is a major failure of separation-of-concerns. If you had a stream of data that was gzipped, would it ever make sense to look at the bytes in the data stream before decompressing it? Variable-length text encodings are the same. Application code should only see Unicode code points.
In general it was a mistake to put variable-length encodings into the Unicode standard. A much better design would have been to use UTF-32 for the application-level interface to characters, and use a separate compression standard that is optimized for fixed alphabets when transporting or storing text. This has the advantage that the compression scheme can be dynamically updated to match the letter frequencies in the real-world text, and it logically separates the ideas of encoding and compression so that the compression container is easier to swap out. And, of course, an entire class of bugs would be eliminated from application code.
Edited first paragraph to clarify: Variable-length text encodings are the same.
I agree that that is the ideal end situation, but Unicode would have been dead on delivery if they had chosen that approach. Memory just was too expensive at the time to make a system that, in most of the computer-using world, wasted 75% in every text string. And no, just-in-time decompression wouldn't have worked either; CPU cycles also were too expensive at the time to do that.
That makes it sound as if UTF-32 would be a silver bullet - it's not. Application code normally has to deal with user-perceived characters/glyphs/grapheme clusters, which means you'll have to treat UTF-32 as variable-length as well.
If you want to get back the supposed benefits of UTF-32, you'll have to dynamically assign codepoints to grapheme clusters.
I'm impressed. Easily readable and understandable, short and as far as I can tell no factual inaccuracies and wrong information (unlike many other Unicode introductions and tutorials).
You want some criticism, because we have too little of that here on HN? I'll bite. ;)
"A byte is a set of 8 bits. Computers typically move data around a byte at a time."
A byte being 8 bits is ok. Historically, a byte might have a different number of bits, but all modern architectures use 8 bits. Since this is an introductory article, this is fine. (more details: http://en.wikipedia.org/wiki/Byte)
Computers do not typically move data in byte chunks. You could say "a byte is the smallest unit of data a CPU can load or store". If you talk about moving data, the question is between what. Probably memory. However, there are caches nowadays, since bandwidth is cheap and latency is expensive. Data is moved in cache line chunks, which means 4-64 byte chunks depending on architecture and cache level. Bigger chunks in upcoming architectures.
I wouldn't say "all modern architectures use 8 bits". Plenty of DSPs have > 8-bits (SHARCs are 32-bit, ZSPs are 16, 56K-like DSPs are 24.
A byte is the smallest thing that a pointer can point at. Something like MIPS (from memory) has 8-bit bytes, but a load instruction loads 4-bytes at a time (and although it has the unaligned load functions which can be convinced to load a single byte, they aren't general purpose)
There's some subjectivity in where you consider it standardized, but I would put it later than that. Throughout the period when IBM's System/360 was popular, DEC's PDP series was also popular, which used 36-bit words and a somewhat varying definition of what constituted a byte.
If I had to pick a date, I'd pick 1977 as when it solidified. A few elements: ARPAnet standardized on octet-based protocols in the early 1970s, and was getting widespread by the mid 1970s; three popular microcomputers based on 8-bit architectures were introduced in 1977 (the Apple II, TRS-80, and Commodore PET); and DEC introduced the 32-bit VAX in 1977.
I think this is great explanation for anyone who's approaching the subject for the first time. It gives a good introduction as to why you're staring at broken data coming from a db and just how royally screwed your afternoon is going to be getting back into shape :)
It's worse than that, actually, as ASCII from the start¹ included provisions for variants for non-English latin characters and alternate currency symbols, and ASCII was essentially the same project as ECMA-6² (ECMA being the European Computer Manufacturers' Association³, a standardization group founded in 1961).
ASCII as we know it (which is essentially the 1967 version⁴) like the corresponding ECMA standard⁵ provided for overloading punctuation characters as diacritics ("/¨ ^/ˆ ~/˜ '/´ ‘/` ,/¸) to be overstruck in typewriter fashion; ECMA-35⁶ (1971⁷) defines further extension techniques using control and/or escape sequences.
So, yes, it's just a failed attempt at an anti-American cheap shot from someone who isn't familiar with the development of character set encodings.
I'm guessing it depends on the language. In spanish we have only one type of accent and only in vowels: áéíóú, and the ñ.
The ñ is right next to the "L" key, and the tilde (accent) is either right next to the ñ or on top of it.
This adds some sort of complexity, but in my experience, the average user simple expects the key to be "where it always was" and I've had to "fix the problem" (changing the input method) many many times.
To answer your question I think that the amount of users that actually take note of what happened and understand it enough to fix it again is minimal, the rest just expect it to work and ask for help when it doesn't.
There are also a lot of variations of spanish keyboards, so it just makes the matter more complicated... I use *unix, Windows and OSx almost interchangeably and know how to change the input language in most of them to ISO spanish spanish quite quickly, but I'm not representative in that regard.
Slightly offtopic but the spanish spanish keyboard layout is extremely comfortable for programming...
"Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length."
Is this really true? My impression was that UTF-32 is a fixed-length encoding which uses 32 bits to encode all of Unicode. It seems that this means that Unicode can never have more code points than could fit in 32 bits. Right?
Unicode can never have more code points than could fit into 21 bits. Because Unicode is a 21-bit code. This has historical reasons because Unicode was a 16-bit code initially and it soon became apparent that 65536 characters are not enough, especially with the commitment of
a) compatibility to every pre-existing character set and
b) including historical scripts too¹
21 bits was what emerged from expanding UCS-2 to UTF-16 via surrogate pairs and Unicode was reörganised into 17 “planes”, the first of which, the BMP, containing all code points allocated so far. UTF-32 then just was a simple encoding scheme that allows to have one code point per code unit that is also efficient to process. 21- or 24-bit code units would be unwieldy on most architectures (especially regarding unaligned memory access).
___________
¹ Arguably the decision to include Emoji made a bigger dent in the code point space than hieroglyphs, Linear [AB], etc., though, but that came a litte later.
Unicode code points go up to 0x10FFFF, which fits in 21 bits.
If this were ever to become a problem (which I don't see happening any time soon), the transition from UCS-2 to UTF-16 is prior art on how to pull off the extension of a coding space.
Somewhat unrelated, but nevertheless worth mentioning as it's a common misconception: While UTF-32 is a fixed-length coding for Unicode characters, often, the more interesting unit is the grapheme cluster, effectively making UTF-32 into a variable-length coding.
Think you're right. The article could with a rewrite as Unicode used to be 16 bit until 1996 (according to Wikipedia) which explains why Java/Windows are really UTF-16 based.
The more interesting question is if you're designing a new operating system would you pick UTF-8 or UTF-32 as the basis of your character system. Bearing in mind you need to normalise strings anyway for comparison purposes the general space efficiencies for UTF-8 for most systems seem tempting.
I believe it depends on the encoding - UTF-32 has a limit of 2^32 characters, but UTF-8 could potentially expand infinitely (ignoring the current arbitrary 4-byte limit).
That's incorrect - the UTF-8 scheme encodes the sequence length in the first byte as the number of bits set to 1 until the first 0.
The original limit was 6 bytes encoding 31 bits, but this was lowered to 4 bytes encoding 21 bits. If I'm not mistaken, 7 bytes encoding 36 bits (or 8 bytes encoding 42 bits if you make the 0 optional) should be the hard limit.
In your scheme, we'll end up with multiple leading bytes, though.
Also, how do you plan to handle 9 leading bits instead of the 10 from your example? That would make the second byte start with 10, the marker for continuation bytes.
Ouch, you're right. Refreshed my memory of UTF-8 encoding and now it seems it's designed so it can't be made to express arbitrary values. Not sure why I imagined the opposite. My mistake.
The current limit (0x10FFFF) is the artificial restriction on UTF-8 to match UTF-16, not the upper bound. UTF-8 supports 2^31 code points with it's current encoding. So nothing needs to change until Unicode consists of 2.1 billion code points or so.
In my mind that will only happen as a result of extreme carelessness.
> These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.
I don't think it's mere coincidence that the capital letters start at 65 and the lower case at 97 and the decimal digits at 48.
It's not a matter of winning or loosing. The pre-unicode mix of character sets was a mess when it came internationalization. Try truncating a Japanese Shift-JIS string in C. That will learn you..
Arguably Unicode (UTF-8 and -16) doesn't necessarily make this any easier. Or any variable-length encoding, really. You see halved code points quite frequently, and if not that, then halved " and the like.
OT and out of curiosity...how do non-native English speakers experience typing/keyboard education? I can barely remember how to make any of the basic accents over the `e` when trying to sound French...are typing classes in non-English schooling systems much more sophisticated than in English (i.e. ASCII-centric) schools? I wonder if non-native English typists come away with a better handling of the power of keyboard shortcuts (whether to create accents or not)
In the Netherlands we use the "US International" keyboard, which is basically qwerty like you're used to but with dead keys to make áéàüetc. We don't have that many odd characters so it's okay not to have special keys for it, but I'd agree with anyone saying that even these few are too many. If we're gonna clean the language up though, let's also get rid of other overhead. For example the sentence "If we clean language though, rid other overhead" makes quite a lot of sense.
Other countries that make even less sense and use way too many special characters (from my blunt perspective) usually have different keyboards. Prime example that I know of is France where you have to use the shift key to make numbers.
To answer what you actually asked: Yes we are taught how to make those special characters. For me it's as logic as typing parenthesis or the euro sign, but I'm spending most of my waking hours typing one thing or another. Many people don't or barely know how to.
It's not what most of the time is spent on in those courses. We're being taught the asdfjkl; row just like everyone else. Or aoeuhtns, depending on your keyboard (in the Netherlands we only have qwerty though). Making accents is more of a side thing that's mentioned once or twice after learning everything else at an acceptable speed.
In Korea, almost all keyboard is just plain QWERTY keyboard with additional Hangul Printing.[0]
Right Alt key used as Hangul(Korean)/English toggle, right Ctrl key as Hanja.
When toggled to Hangul, only English characters are overridden by Hangul characters. All numbers, symbols are also same when you are in English typing mode.
Basically no additional key in there compared to QWERTY.
Is that complex and hard learning type in Hangul? Nope. Maybe 'Korean' is complex to learn, but 'Hangul' - I mean, script? character composition system? sort of that - is quite simple.[2]
Actually It's capable of implement more efficient input layout than English especially more restricted environment. Like basic cell phone key layout(E.161).[1]
There was a King, and He was really great hacker. Because he was a King, he grabbed bunch of smart guy all around country. ; ) Then push them working hard. (did I said he was King?) Therefore, invented many good one for country people. Today Korean has own quite good and expressive characters and he deserved quite good place.[3]
Chinese has many different input methods. For example, in Taiwan most keyboards have the symbols for 4 different input methods printed on them: English/Pinyin on the top left, Zhuyin on the top right, Cangjie on the lower left, and Dayi on the lower right. The top two are phonetic, and the bottom two are symbolic.
Chinese input methods typically require a sequence of key-presses and then a selection from a menu of matching characters, with the most common matches first. Multi-character sequences can be entered without making a choice until the end, in which case the most likely n-grams come first.
Most people in Taiwan learn phonetic (Zhuyin) in school, which is very easy, as long as you know standard pronunciation.
Japanese often type in "Romaji" (a romanized transliteration of Japanese). Basically, they type phonetically in roman characters and the computer suggests/autocompletes characters which match the phonetics. It's a little like typing SMSs in T9 on old numberpad phones.
I switch (with a key combination) between a French keyboard, when I need to write text in French with accents, and an English keyboard when I need to type various types of brackets e.g. [] and {}. I actually know two different French keyboard layouts ... and can touch type fairly efficiently using any of the three. I don't consider myself a very efficient user of keyboard shortcuts.
Well, when 99% think unicode = encoding = ucs2 = utf-16, don't believe there's something outside BMP, and wtf is the only word coming to their mind when they hear about graphemes… Unicode won?
Unicode is the replacement, not the competitor, like 64-bit IP addresses are the replacement for 32-bit IP addresses. It was developed in the early 1990s when RAM got cheap enough that you could afford two-bytes per character.
Personally, I deal with data all the time and rarely encounter unicode. Of course, I'm in the US dealing with big files out of financial and marketing databases. In fact, I've seen more EBCDIC than UNICODE.