3

Unicode 15 released

 2 years ago
source link: https://lwn.net/Articles/908032/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Unicode 15 released

[Posted September 14, 2022 by corbet]
Version 15 of the Unicode standard has been released.
This version adds 4,489 characters, bringing the total to 149,186 characters. These additions include two new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.

(Log in to post comments)

Unicode 15 released

Posted Sep 14, 2022 18:16 UTC (Wed) by SLi (subscriber, #53131) [Link]

Why do we get to use the "bottom left part of glyph is damaged" modifier only for hieroglyphs? :(

Unicode 15 released

Posted Sep 14, 2022 21:03 UTC (Wed) by flussence (subscriber, #85566) [Link]

It makes sense given the percentage of source material that needs it, but I agree. Plenty of stuff written on paper could've used that too!

Unicode 15 released

Posted Sep 15, 2022 9:10 UTC (Thu) by n8willis (subscriber, #43041) [Link]

So the good news is that it's not really forbidden or anything; the Script property in UCD is defined for text-processing purposes and TR24 says implementers need to support "out of scope" usage (https://www.unicode.org/reports/tr24/#Out_of_Scope).

Thus the super-secret path to getting it pushed up into the realm of the commonly accepted would probably be to open an issue on HarfBuzz to try and get it explicitly added to the various shapers, then once that's done, insist that a glyph for it be added to some high-profile FOSS fonts, then finally petition Unicode to add it to the Script Extensions on the grounds that it's so common.

Couple years, tops.

Unicode 15 released

Posted Sep 14, 2022 19:05 UTC (Wed) by alspnost (guest, #2763) [Link]

Finally, a WiFi emoji!

Unicode 15 released

Posted Sep 14, 2022 21:15 UTC (Wed) by Sesse (subscriber, #53779) [Link]

And one for HONK

Unicode 15 released

Posted Sep 15, 2022 5:02 UTC (Thu) by suckfish (guest, #69919) [Link]

What is the practical limit on the number of unicode code-points and when are due to run out on the current trajectory? (Is the growth rate linear or faster or slower?)

UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff). If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?

(If we followed the UTF-8 scheme systematically you can go to 7 byte encoding and a 36 bit space, and leave the byte value 0xff as a prefix for even further extension. So theoretically possible...)

Unicode 15 released

Posted Sep 15, 2022 7:36 UTC (Thu) by dh (subscriber, #153) [Link]

Unicode has some 1100000 possible code points. With 150000 assigned and 5000 new ones per year we're save for another 190 years. So while this might lead to a year-2212-problem, I'd say it's a bit early to take actions.

Unicode 15 released

Posted Sep 15, 2022 14:20 UTC (Thu) by zwol (guest, #126152) [Link]

> I'd say it's a bit early to take actions.

I think it would be better to stop baking the artificial 16-plane limit into every program that processes UTF-8 /now/. The longer we put this change off, the more places will have to be changed and the more painful it will be. It's exactly the same deal as Y2K and Y2038: fix it early or it'll cost more.

Unicode 15 released

Posted Sep 15, 2022 17:40 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

First we would have to persuade Microsoft and Oracle that UTF-16 was a bad idea. For backcompat reasons, that's basically never going to happen (i.e. Windows and Java still use UTF-16 extensively and seemingly have no plans to remove or deprecate it).

Unicode 15 released

Posted Sep 15, 2022 18:20 UTC (Thu) by devslashilly (guest, #124291) [Link]

Good news since java 18 it's been UTF-8 https://openjdk.org/jeps/400 now we need to wait the 10 years for people to update their jvms.

Unicode 15 released

Posted Sep 15, 2022 23:10 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

That's nice, but String is still UTF-16 according to https://docs.oracle.com/javase/7/docs/api/java/lang/Strin..., and that's arguably a bigger problem than what charset the OS-level APIs use.

Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break. The problem, you see, looks roughly like this:

0. In the bad old days of everyone using a different ISO-8859 variant (except for East Asia, where they had a wildly different set of encodings because CJK scripts are huge), Windows came in different editions, and the code page was baked into the OS. All of their APIs would transparently use the OS-level code page, and if you wanted to support other code pages, too bad, it was impossible.
1. At some point, they decided that was too ugly, and modularized things to such an extent that you could install multiple locales on the same computer and switch between them. But the old APIs were still around, so the OS-level code page became the application code page, set by default to "whatever the active locale specifies," and the API grew a few functions for changing the code page if desired. Eventually, they also added manifest support so you could just do that at the packaging stage instead of having to write actual code for it.
2. The Unicode Consortium comes along and tells everyone "We're doing this great new encoding, it'll have all the languages and fit into 16 bits." Microsoft decides to go all-in on this, and deploys a brand-new set of APIs which are identical to the old APIs, except they use wchar instead of char, and only accept UTF-16 (native endianness, usually LE). The old char APIs are informally deprecated but continue to exist for backcompat reasons. Also, they introduce a whole bunch of preprocessor macros so that you can just code against the two APIs as if they were one API, not think about charsets at all, and then decide which API to use with a single global #define at build time.
3. Everyone figures out that 16 bits is not enough, and surrogate pairs are born. Microsoft bolts on surrogate pair checking to their existing UTF-16 APIs and calls it a day. (Technically, before surrogate pairs existed, it was called "UCS-2" rather than UTF-16, but I have avoided using that name to prevent confusion. It's exactly the same encoding, other than the existence of surrogate pairs.)
4. Everyone figures out that UTF-8 is far superior to UTF-16. After much hemming and hawing, Microsoft adds a code page for UTF-8, as if it's just another legacy encoding, but eventually they update their documentation to vaguely suggest that maybe using the char API with the UTF-8 code page is better in some circumstances. Also, they make the UTF-8 code page the system-level default in all locales, but that only affects the char API, because the wchar API never used code pages in the first place (it's hard-coded to UTF-16 and always has been).
5. So now, the wchar functions must continue to exist, and must continue to use UTF-16, or else lots of applications will stop working. Microsoft *could* tell everyone to recompile against the char functions with a UTF-8 code page, and then drop wchar support, breaking everyone who didn't recompile, but they are not Apple and can't get away with doing something like that. Also, the wording of their documentation strongly suggests that much of the NT codebase uses UTF-16 internally and that it's the "native" encoding of modern Windows, so changing the APIs would be putting lipstick on a pig anyway.

Unicode 15 released

Posted Sep 16, 2022 0:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> Meanwhile, Windows is never going to give up UTF-16, unless they have a full backcompat break.

Windows allows you to use the "-A" functions with UTF-8 now. It just internally translates the strings into WTF-16.

It's possible to flip this around to using UTF-8 internally and translating the WTF-16 into UTF-8 on the system library border. This will have impact on kernel-level drivers, but even there a compat layer can provide a smooth transition.

This is also made easier because of microkernel-ish design of Windows, where most ioctls/syscalls are actually done via a sort of message passing. So translation can be done in a central location that is fairly straightforward to maintain.

Unicode 15 released

Posted Sep 15, 2022 11:08 UTC (Thu) by grawity (subscriber, #80596) [Link]

> UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff)

UTF-8 has a 31-bit space (0x7fffffff). It's artificially capped to 0x10ffff because that's the limit for UTF-16 surrogate-pair encoding.

Unicode 15 released

Posted Sep 16, 2022 1:21 UTC (Fri) by scientes (subscriber, #83068) [Link]

> UTF-8 has a 31-bit space (0x7fffffff).

Only glibc accepts the version of 5 and 6-byte characters you are suggesting (which is docmented in th utf8 man page). Other encoding complete reject the following bytes: c0, c1, (utf16 surrogate pairs) f8-ff.

Unicode 15 released

Posted Sep 15, 2022 19:45 UTC (Thu) by plugwash (subscriber, #29694) [Link]

> UTF-8 and 4 byte encodings give a 21 bit space, (but for whatever reason it is defined to max out at code point 0x10ffff rather than 0x1fffff).

"whatever reason" being UTF-16.

> when are due to run out on the current trajectory? (Is the growth rate linear or faster or slower?)

Unicode has 17 planes, numbered 0 to 16.Planes 15 and 16 are reserved for private use, while planes 0-15 are mostly used for public allocations. Of those.

Unicode maintains a series of roadmaps on their site showing the status of the various planes. Taking a quick look at them.

Plane 0 (the basic multilingual plane) is basically full at this point,
Plane 1 (the supplementary multilingual plane) about half is formally assigned, but much of the rest is "pencilled in" for various scripts.
Plane 2 (the supplementary ideographic plane) is getting pretty full with a series of "CJK unified ideographs extension"s, though there is a little bit of unallocated space (presumablly because "CJK unified ideographs extension G" was too big to fit. There don't seem to be any tentitive allocations in this plane
Plane 3 (the tertiery ideographic) plane has "CJK unified ideographs extension G" which takes up a bit over a 16th of the plane, plus tentative allocations for various historic scripts used in China. Even taking the tentative allocations into account it's less than half full
Plane 14 (the supplementary special purpose plane) is mostly unused.

Planes 4 through 13 are completely unallocated.

So about two and a half planes worth of code space are formally assigned with tentative allocations taking that up to around 3 and a half. That is less than a quarter of the total encoding space.

> If we reach that limit, is it feasible to extend or does some combination of security/compatibility/stupidity make extension non-feasible?

Basically you would have to.

1. Define how to extend the existing encodings, for UTF-8 and UTF-32, that is trivial since both encodings are artificially limited, but UTF-16 would be more difficult.
2. Update software than handles encoding conversion and validation to accept the new versions of the encoding.
3. Get that modified software deployed widely

It's not impossible, but I think it's unlikely that a sufficiently compelling reason will come along to motivate the global software community to do it.

I think it's more likely that if/when encoding space becomes a problem that proposers will be encouraged to encode their scripts in a way that uses less encoding space, i.e. making more use of combining characters, variant selectors and so-on.

Unicode 15 released

Posted Sep 15, 2022 21:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> "whatever reason" being UTF-16.

We should just deprecate it and start moving away from it. UTF-8 is the best choice anyway.

Even Windows is supporting true UTF-8 APIs these days.

Unicode 15 released

Posted Sep 15, 2022 22:50 UTC (Thu) by khim (subscriber, #9252) [Link]

> Even Windows is supporting true UTF-8 APIs these days.

Externally but not internally.

I guess the first step would be to encourage to move programs to UTF-8.

Because that's the most important step, anyway.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK