I feel so sorry for Arabs now, just read that paragraph about everyday experience of trying to write English-Arabic text in the mail or any other editor.
> I have watched senior engineers, fluent in both Arabic and English, give up on writing a long email in Outlook on a Wednesday afternoon because the cursor would not behave, and switch to Arabic-only or English-only because the cognitive cost of fighting the editor exceeded the cost of monolingual phrasing. Actually I remember very well suffering this while using Facebook for the first time in my life, and I could not register; I was very slow typer that when I reached the moment the cursor does this weird thing, I would just stare at it and never progress.
> This is the ordinary experience of writing mixed Arabic-English text in 2026, in every major editor, email client, and chat application I know of. The pettier cousins are everywhere too, and I collect them: a range like 10–20 silently reading as twenty-to-ten, because digits are weak and the dash is neutral; a trailing exclamation mark teleporting to the far end of the line; a password, toggled visible, displaying in an order that does not match what was typed. None of these are anyone's bug, exactly.
My own Cyrillic struggles are nothing in comparison.
I was lucky enough to date someone from the Gulf, so I forced myself to understand and read Arabic scripts (but not understand the text), which is a huge bonus when creating multi-lingual designs.
I already had a good understand of CJK scripts, and you'll come across RtL there, with things like tategaki which is both vertical and RtL at the same time (and can include quotes in other languages such as English and Arabic). Here's some lyrics I made in that format for reference:
The complexities of mixed LR and RL text are quite astonishing since it’s not really even a case of just switching modes when switching scripts since double-nested (or more) texts can change the semantics of line breaks. This article provides a good overview: https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf¹
In college,² when I wanted to quote some texts from Exodus in Hebrew in a paper that I wrote, I ended up avoiding the issue by hand-reversing the letter order and manually breaking lines. 8 bits is insufficient to cover all the possible combinations of letters and vowel markings so the font didn’t include any vowel markings and only did dageshim for בּ and פּ if I recall correctly.
⸻
1. As an aside, it would have been really nice if Unicode provided a R-L mirrored Latin alphabet to make it easier for monolingual developers to grasp the complexities surrounding mixed directional typesetting. I suppose it could still be added, although Unicode tends towards conservatism on adding additional characters.
2. This was 1990, well before Unicode in the era of a hundred or so 8-bit character encodings, most of which were not implemented widely. I also had to type the text using the arbitrary ASCII-Hebrew mapping of the font I was using which, among other things, led me to discover that letter frequency in Hebrew is much more uniform than it is in English.
One thing I sometimes think about when I think about text layout problems is how the text we use also has a bunch of complexities that we can take for granted.
Think of variable width characters and kerning and ligatures and hyphenation and justification. Imagine computers had been won by a CJK language, which have none of these problems. You could imagine a similar article about how exotic and difficult English layout is.
Both Latin and Chinese have been modified by the technology used to write them.
When carved in stone the lines are much straighter. When written with brush or pen they became semi-cursive. When printing was introduced, they became grid-like and regular.
What westerners who are passingly familiar would think of as the standard Chinese typeface - the strict square grid with straight-line characters - arises in part from printing technology. Easy to carve that into wood blocks, and easy to line up the slots into a grid.
Latin was similarly morphed to fit into the realities of printing in the 1500s. And is still being morphed. Notice how numbers 123... are in-line and at the same height as the letters. That's a very modern convention, typewriter and computer influence on our orthography. Traditionally digits were more likely to appear as subscript, off-centre.
what selective pressures against oldstyle numerals with ascenders/descenders existed that wouldn't have equally applied to letterforms with those same features?
(aha i have found the answer to my own question: miniaturization for fractions in phototypesetting)
the other part is that numbers and symbols were very much not the priority. The printing press was for books, magazines etc. math remained hand written until the computer
Nope, not at all. Monotype had a special system for doing math in hot metal typesetting. With handset type it was possible, but very time-consuming. You can find typeset mathematics going back centuries before the computer. There were also (somewhat impractical) systems for setting music with metal type although engraving was more common because of the interactions of lines and symbols.
Not really. The selective pressure really comes well before that: Tabular presentation of numbers, whether that was log/trig tables or railroad time tables, there was a preference for uniform-width and regular height characters for those contexts (this is also why there is a number-width parameter in TT typography to enable a designer to let digits be variable-width in text but still allow tabular setting if desired).
Conversely, English has a joined form(cursive) that is nearly dead because mechanical text assistance devices (first typewriters, now computers) work much better with the block form. While sad in a cultural loss sort of way the joined form only really makes sense when the text is hand written.
I am not familiar with the history of Arabic typography, but I sort of assume there was an archaic block form and their current joined form is the result of many centuries of encoding hand writing practice. advanced enough that falling back to a block form is impossible with the side effect of making simple mechanical text formatting also impossible.
As for Chinese derived characters. we currently are able to jam them awkwardly into our alphabet optimized structures(one code per character) but I wonder if a Chinese native encoding would look different. Would it make sense to try and represent the sub-characters present in each Chinese character in the encoding? I suspect not, Chinese works, but it also does not appear amiable to simple mechanical assistance.
Another wrinkle with Arabic is linguistic conservatism. Due to Islamism and the idea that Arabic is the language of of God (the Quran was written in Arabic by the supposedly illiterate prophet), Arabic has lagged behind other languages in terms of innovation.
Hebrew is a closely related semitic language that simply adopted a block and cursive form. It has also been greatly simplified and friendlier towards loanwords, which has made it far easier to learn.
Muslims don’t believe Arabic is the language of God. They believe that the Quran was revealed in Arabic (true). Thinking the creator of the heavens and earth only speaks one language is absurd. It also kind of implies that Muslims believe in a superiority of Arabs which is also not true.
Weird to say Arabic hasn’t innovated or evolved considering the wild variety of dialects spoken in the modern world.
Conflating the language with the script is also bizarre. In terms of adapting Arabic to technology, look into romanized Arabic which was used before Unicode was common.
I didn't write "God only speaks Arabic" in Islam. That's your intepretation of my post. All I meant was that Arabic has special status in Islam.
> Weird to say Arabic hasn’t innovated or evolved considering the wild variety of dialects spoken in the modern world.
I didn't say Arabic has not innovated or evolved; only that it "has lagged behind other languages in terms of innovation". My belief is that that is due to linguistic conservatism, and linked to Islamism (or, at minimum, the centrality of Islam in Arab culture). Also related to this is the existence of Fusha, its place in Arab culture, and its branding as "modern standard Arabic".
I didn't conflate anything. While a script and a language are not the same, it's not a coincidence that Arabic is often written today in a script that is very close to Quranic script. And -- to really kick the hornet's nest -- it's also not a coincidence that there have been so few outstanding Arab writers (in Arabic) in the past 100 years. One novelist and a couple poets.
There's the https://en.wikipedia.org/wiki/Ideographic_Description_Charac... that kind of does that. The problem is that there's character divergence (see all the brouhaha about Unicode Han unification), so there needs to be something else to select variants too.
As a reference, I don't believe any of the pre-Unicode CJK&c encodings attempted that.
Looking at dictionaries and printing presses from China before the invention of computers reveals that they probably would have done something similar to ascii, just with more bits to encompass all the characters.
There is no unjoined form of Arabic. The Arabic script became Arabic when Nabataean script started developing joined letter forms. Unjoined Nabatean is as foreign to Arabic as Phoenician is to Greek.
This explains a lot of why I think Arabic looks so beautiful on the page. I also love the way it can sound when spoken. Shame I don't have reason good enough to take the years I would need to learn it.
Learning language is like learning poetry. Your appreciation of the thing grows proportionally with the practice. Which means you don't need to learn to fluency to appreciate Arabic. Even modest practice, just learning to write the alphabet, will increase your appreciation for the language if that's what you'd like to do.
I could write all day about how badly RtL languages are treated on electronic devices. Most interfaces just slop RtL text into an interface that is otherwise the totally wrong way around.
One thing that amuses me is that people share these "safe zone" templates for short form video to make sure your content isn't hidden behind the buttons:
I actually like that they depend on position, it even carries semantic information when reading because of how conjugation works so it actually helps you read faster
> The relevant rule, W2 of UAX #9, reclassifies a digit as an ARABIC NUMBER if any of the previous strong characters in the paragraph were Arabic letters, and as a EUROPEAN NUMBER otherwise. Both render their internal digits left-to-right, which is correct: numbers everywhere on Earth are read most-significant-first.
Does the author mean most-significant-on-the-left? The statement as written is a statement about the order in which one reads or perhaps thinks the number, whereas I think the author is discussing how numbers, including collections of numbers delimited by hyphens and such, should be laid out on the page.
This is a thing I have wondered about. are arabic numbers little endian when embeded in arabic text? The text is read right to left. But the numbers are the same as we put them in our left to right text where they are big-endian(read big to small). so are numbers in original arabic little-endian? read small to large. with the interesting side effect that numbers are universal.
I think least significant first is the most natural order for written numbers (and it makes addition easier.)
I had always assumed that was what was intended with Arabic numbers, only silly Europeans made a mistake when they borrowed the positional system and forgot Arabic is written the other way. (Or perhaps intentionally avoided mirroring the digits for ease of communication?)
But the author of this article makes it sound like even in Arabic, numbers are read out loud most significant first.
I think he's talking about the rendering algorithm with regards to the stream of text. essentially saying rendering direction should follow reading convention.
on the other hand, in formal arabic, it's not unusual that numers are read in clusters from least significant to most significant (right to left). 1984 would be read : eighty four and nine hundred and a thousand. not sure if the author is aware of this
> rendering direction should follow reading convention.
What does that even mean in this context? In a strictly LTR language, sure, you read left-to-right and the glyphs are rendered left-to-right. But the whole discussion is about bidirectional text, where the text is rendered by a complex algorithm. What is the “rendering direction”?
I know just enough about some RTL languages to know that one can absolutely intersperse RTL text with, say, and English phrase, and you still read the first (leftmost in the group) English sound first and so on :)
if you write (with a pen) text with mixed arabic, numbers, english words. then somebody else gets to read it, he will read arabic from right to left, then when when he encouters a number or an english word, he will naturally jump to the first latin letter of the word and start reading left to right, then jump back to the beginning if the next arabic word and switch back to RTL. the alogorithm should copy that behaviour.
This article is wonderful. It's interesting, it's captivating, full with detail, and to think I never gave much thought about Arabic rendering before.
This part nearly had me chuckle audibly:
He says yes. The result is "Simplified Arabic": initial fused into medial, final into isolated, ligatures dropped. It conquers the Arab newsroom in a generation. Mrowa is assassinated at his desk eight years later, by an unrelated faction, in an unrelated dispute.
Also, it's depressing how hundreds of millions of people couldn't even get their language typeset on a computer, and our industry meanwhile was busy building AI-native AI for your groceries (have we mentioned it has AI btw?) and similar performative bullshit.
> Also, it's depressing how hundreds of millions of people couldn't even get their language typeset on a computer, and our industry meanwhile was busy building AI-native AI for your groceries (have we mentioned it has AI btw?) and similar performative bullshit.
AI also brought you this "wonderful" article, I would note.
AI can probably solve "how do we get good shaper implementations into software", provided a good enough spec and test suite are available, but it won't solve "how do we convince stakeholders to value supporting languages spoken by massive numbers of people", most likely
Arabic script is a great test to see if your terminal/renderer/UI can handle anything: contexual shaping, cursive connectivity, bidirectional text layout, diacritics and vertical displacement.
I went down this rabbit-hole awhile back and it made me really appreciate the complexity of the script.
Internet Explorer 5.5 implements text-justify: kashida. For one brief, weird browser-quarter Microsoft is the only software vendor on earth that can justify Arabic correctly on a screen.
Very interesting. I just implemented a text shaper and renderer from scratch with support for complex scripts like Arabic, Nastaliq and Indic (will soon post about it here on HN). Now that you write about it, the lack of stretching really is a deficiency in the OpenType spec.
If you want a solution for this it has to happen in the rendering step, not the shaping (which is HarfBuzz's main task). The shaper has no information about the available space, but when rendering you could stretch individual glyphs to the desired width, similar to adjusting the width of whitespace in Latin, but more complex, because you actually have to modify the glyphs with a scale transform. I am not an expert on Arabic script by any means, but this should be possible IMO. It would at least be an interesting experiment. Of course the JSTF table would be the right way to do it, but there seems to be a lot of confusion around it. Maybe in the age of LLMs we can give it another shot.
It does seem like a KP-like algorithm ought to be able to optimize the break positions without extreme algorithmic difficulty aside from the inputs being considerably more complex than for Latin block print: the cost function for a proposed line is a straightforward [0] calculable function of the contents of the line, and I think one could make a dynamic programming algorithm that tracks, for each input position, the cost of the optimal layout of all text up to that position with a break at that position. This gives an algorithm that takes cubic time. (For input length n, you need to fill in n values in the table. Each value scans the entire table before that position and does a calculation with complexity linear in the proposed line length.)
As a practical matter, there’s an input length n and there is some upper bound B on a credible line length as measured in code points, so there are only at most n*B credible proposed lines to evaluate, which also limits the useful look back on the table to B positions, so I think the time complexity could be reduced to O(n*B^2) without making the results worse on reasonable inputs, and this is probably quite tolerable.
[0] Straightforward once you’ve implemented the whole Arabic rendering stack, anyway. I am certainly not qualified to calculate this function :)
> but when rendering you could stretch individual glyphs to the desired width
“Individual glyphs” :)
It’s Arabic, so you wouldn’t stretch a single glyph, id would have to e done after shaping so you can work out the next run (either a single Aleph or the joined characters) in order to know what is stretchable (then throw it to your layout step)
That might be because of translations from Arabic. The article was also posted on a different website where the author responded
> the Kashida section was contributed to this post from a talk in Arabic of Nawal Hadeed, which she translated and added to the post herself. Although I'm unsure of LLM usage in the translation process, looking at the original Arabic I felt some change in tone while editing the post. I could have either declined the translation and never have this documented, procrastinate in translating it myself (which has been ongoing for a while), or publish as it is. I found the last least damaging.
Disclaimer: I’m not fluent in Arabic by any means, but the stretched out to both margins style looks very Quranic to me. I don’t think it looks appropriate for say, a message about my DoorDasher.
> I have watched senior engineers, fluent in both Arabic and English, give up on writing a long email in Outlook on a Wednesday afternoon because the cursor would not behave, and switch to Arabic-only or English-only because the cognitive cost of fighting the editor exceeded the cost of monolingual phrasing. Actually I remember very well suffering this while using Facebook for the first time in my life, and I could not register; I was very slow typer that when I reached the moment the cursor does this weird thing, I would just stare at it and never progress.
> This is the ordinary experience of writing mixed Arabic-English text in 2026, in every major editor, email client, and chat application I know of. The pettier cousins are everywhere too, and I collect them: a range like 10–20 silently reading as twenty-to-ten, because digits are weak and the dash is neutral; a trailing exclamation mark teleporting to the far end of the line; a password, toggled visible, displaying in an order that does not match what was typed. None of these are anyone's bug, exactly.
My own Cyrillic struggles are nothing in comparison.
reply