novov blog - Unicode is Worth It

Unicode is Worth It

My thoughts on a colourful cast of characters.

27 Oct 2023

In computing, one guideline has came to define how text is encoded: the Unicode Standard. The vast majority of letters you see on the screen - on a website, Word document, or in the file system - uses the UTF-8 encoding of it.

Being a nearly universal standard, it receives plenty of criticism. In fact, I was being a little misleading with the title, since I am not above criticising it myself. Plenty of Chinese, Korean, and Japanese computing experts have sharply condemned Han unification, and for the most part I agree with them. And as plenty of developers know, Unicode's complex historical context has lead to a design that is at times inelegant and inconsistent: Ω is not the same as Ω.¹ Sometimes, people even advocate doing away with it entirely, and reverting back to multiple encodings. For instance, from an old version of the GNU moe manual:

Replacing all the 8-bit character sets with Unicode is like trying to simplify transportation by standardizing on the same kind of (excessively) large vehicle. I.e., forcing everybody to use a vehicle as large as the largest vehicle anybody may need. Just like owning a normal car may be orders of magnitude cheaper than owning a four-engined airliner, text tools using an 8-bit character set may be orders of magnitude more efficient than those using Unicode.

Or comments by the creator of the D language, Walter Bright:

The imposing is inflicted by the Unicode standard on everyone in the world who have no use for never ending new invented encodings for the same thing... What should give pause in advocacy for the tarpit of Unicode is its unimplementability. That's a giant red flag that something went horribly wrong.

This is a position I strenuously disagree with - a universal character set and encodings that use it are good things.

Core to many of these claims is the idea that different languages should use specialised encodings. Unicode, according to these critics, has accreted unfathomable complexity in service of an impossible goal. And indeed Unicode is complex - but replacing it with multiple encodings would just hand most of the complexity off to a different source. Most OSes handle strings at a fundamental level. Instead of string facilities in operating systems and programming languages having to handle the minutiae of Unicode, they'll have to handle the complexities of different encodings. Imagine that instead of one kind of string in UTF-16 (Windows)/UTF-8 (most Unix), there’s now around 10, all of which have to interact with core OS services. And all of them operate on their own standards, instead of the singular body of Unicode standards.

Even then, this is assuming that different developers even bother to implement it. In our current Unicode world, many still don't handle complex writing systems such as Arabic very well - a fact that is commonly pointed out by detractors of Unicode to admonish its complexity. Let's look at the below Arabic text:

نيوزيلندا

This is legible to your average Arabic speaker. Unfortunately, it isn't when incorrectly rendered as something like:

ن ي و ز ي ل ن د ا

It's easy to blame Unicode, but developers wouldn't have the impetus to handle different languages at all without it. Unicode at least forces compatibility with some form of text, and badly-rendered text is better than no text at all. Given the dynamics at play, the latter would be inevitable with the hypothetical Unicode-replacement.² Currently, a person can transmit Arabic that was badly rendered by one system to another one that competently renders it. But if they can't enter their native language in the first place, then - to use a technical term - they are stuffed.

Unicode's rules also provide a guideline for correct implementation of rendering these languages: there are always inherent difficulties with rendering scripts that are more complex than simply letters side by side, and it makes more sense to account for that in a basal manner. Since developers have to build a non-zero implementation of Unicode features for many things English speakers use and expect, such as emoji, there's an incentive for developers to handle global languages properly. With many different encodings, many simply wouldn't bother.

There's also the issue of mixing different languages. Due to the Western world's preeminent position in the global cultural sphere, a lot of colloquial speech in different languages freely combines various writing systems. Anecdotally, I'm in communities with non-English speakers, and many of them casually mix English and their native language. Even if they don’t, a lot interact with multiple scripts for, say, speaking Czech at work but playing in English on Steam. And this mixing occurs in more formal contexts as well, such as official names, English translations, abbreviations, or technical terms. For instance, a good portion of corporate Chinese, Japanese, Cyrillic, or Arabic text is interspersed with English brand names:

ニンテンドー3DSシリーズおよびWii Uの「ニンテンドーeショップ」

Sure, one could switch between encodings for this - but it's a lot of rigmarole for a task that is far more common than the majority of English speakers would expect.

People will also say that multi-byte characters are bad and should be avoided. This is a very tempting claim to make, as a lot of string handling in low-level programming languages would suddenly become a whole lot easier. Before Unicode became popular, the assumption was generally that a variant of ASCII would be used, where each character would neatly fit in a byte. Much of Unicode's complexity comes from the fact that to store many different characters, it has to encode most characters in two or more bytes.³ So if this was avoided, it would greatly simplify things... except, not really. Although this would work for English and the ASCII character set, languages such as Chinese and Japanese have heaps of characters - a conservative estimate would be at least 200,000 for the former, which is more than is able to fit in a byte, or even two bytes. Systems that maintain the ideal of one byte simply don't work in many places; this might have been easy to ignore in a pre-Unicode era, but in an increasingly globalised world, it's not.

Despite what some people say, supporting non-Latin scripts isn't a nice-to-have. It's very easy in an English-speaking world to dismiss the importance of other languages. But Mandarin Chinese is the second most popular language in the world, and has almost a million people who are native speakers. These people deserve to not only have basic operating system support for naming files, writing documents, and reading web content efficiently, but also be supported in applications that perform more specialised tasks. Even if you've building a tool for an obscure language or niche hobby, or cutting corners and trying to get your product ready to market ASAP, you'll never know who ends up using it.

In the early days of Unicode, multi-byte encoding was also criticised for being inefficient at storing text. Initially, most Unicode software used the UTF-16 encoding, which stores every character in either 16 or 32 bits. But most implementations nowadays use UTF-8, which preserves single-byte encoding for ASCII characters; this also has the advantage of some backwards capability with programs that assume that a byte is coterminous with a character. Only scripts used by other languages, such as the aforementioned Chinese and Japanese, are relegated to higher amounts of bytes. Some people still criticise this as prioritising Western languages, but the amount of these characters is so great that many would still take up multiple bytes even if they were given first priority. And I believe these languages being segregated into their own second-class encodings with limited support would be far more deleterious than the additional bytes.

Another complaint is that Unicode has a whole bunch of useless characters. Typical examples given include:

Emoji: 😖😵‍💫😳😱
Mathematical lettering 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑜𝑓𝑡𝑒𝑛 𝑢𝑠𝑒𝑑 𝑜𝑛𝑙𝑖𝑛𝑒 𝑓𝑜𝑟 𝑑𝑒𝑐𝑜𝑟𝑎𝑡𝑖𝑜𝑛
Enclosed characters 🅻ⓘ⒦🅴 🅃⒣🄸Ⓢ

Due to the vast amount of Unicode characters - mostly Chinese, but plenty exist from other languages as well - any encoding that stores them all must be able to have a maximum length of at least 32 bits. But while the amount of characters cannot comfortably fit into a smaller size, 32 bits still leaves heaps of empty space. So most of the extraneous symbols are, to be quite frank, harmless. If you don't like them, you could almost pretend that they don't exist.⁴ But in fact, they can often be helpful: while people often consider 𝕏 𝖃 and 𝐗 to be useless decoration and an inappropriate usage of Unicode, they often have very distinct uses in mathematical equations.⁵ If other characters deserve encodings due to their semantic meaning, these do as well, and their codepoints have plenty of practical applications, such as screen readers or contexts without rich text.

What is undeniably a negative is that Unicode has multiple encodings for the same symbol. But despite it causing issues now, it was initially a very useful feature. For Unicode to gain wide adoption, it needed to be a lossless conversion target; i.e. it needed to have everything that previous encodings had. While the result may not be optimal in the present, I think it goes without saying that the alternative of Unicode and several other individual encodings would be even worse.

Similarly, the vast amount of characters leads some to claim that nobody can possibly implement all of Unicode. But this is beside the point. Nobody could implement all of the internet cabling that spans the globe. Nobody could implement a lot of things as a single unit, but our collective force allows the world to work cohesively regardless. The same is true here: plenty of Unicode libraries exist, and just like other complex aspects of programming, people can build on the shoulders of the metaphorical giants. On the user-facing end, most modern applications allow fallback fonts. If the user needs certain characters, they simply obtain a font that includes them. If they don't, then it's no skin off their back. In many cases, the OS doesn't even need to know about these characters, so new glyphs can be added and used with minimal pain. Unicode may have a lot of complexities, but this in particular is elegant and simple.

It's easy to forget with Unicode's predominance that there was once a time before it. We've tried living in a world without Unicode; but even in that time, humans still communicated in a vast arrays of characters and scripts. The multifarious encodings only made that more difficult; hence Unicode being created in the first place. Unicode is only making sense of the world that already existed, banishing encoding conversion errors to a relic of the past and making global communication easier. It is just a reification, an actualisation in a sense, of the hundreds of years of human history that already exists.

Unicode is Worth It

Further reading