12

Where Did CP852 Come From?

 2 years ago
source link: http://www.os2museum.com/wp/where-did-cp852-come-from/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Where Did CP852 Come From?

In the 1990s, a lot of my documents were written in code page 852 (CP852), also known as PC Latin 2. This code page is sometimes called “Eastern European”, which is a bit misleading, given that it does not cover major Eastern European countries like Ukraine; sometimes it is also called “Slavic”, which is no less misleading because it covers languages like Hungarian or Albanian that aren’t remotely Slavic.

In those days, fighting with code pages was a constant source of annoyance and pain. DOS and OS/2 used CP852, Windows used CP1250, and Unix/Linux used ISO 8859-2. Of course these code pages were all incompatible with each other. The worst problem was early web where content was often offered in some 8-bit encoding but with no hint as to which encoding that might have been (let’s play a guessing game!). It is a real shame that UTF-8 hadn’t come a bit earlier.

In the early to mid-1990s the situation was further complicated by several non-standard encodings, like the Kamenický brothers encoding in Czechoslovakia or the Mazovia encoding in Poland. Those encodings originated in the mid-1980s and tended to preserve most of the CP437 semi-graphic characters; code page 852 did not, on the other hand it covered quite a few languages. Users initially preferred the non-standard national encodings because those worked better for them, but built-in operating system support pushed those out.

And now I started wondering: When did CP852 become available to users, and where did it actually come from? The first question can be answered reasonably accurately, while the second remains unclear.

The first major product that offered CP852 support was IBM DOS/MS-DOS 5.0 in 1991. This was followed by OS/2 2.0 in 1992. There is no indication that any earlier DOS version supported CP852, although in 1992, Microsoft sold a shrink-wrapped product called AlphabetPlus which allowed installing CP852 and the associated national support on top of MS-DOS 3.3 or 4.0.

As it turns out, thanks to the surviving betas of MS-DOS 5.0 it’s possible to define the timeline of CP852 support in DOS quite accurately. In the MS-DOS 5.0 beta from December 1990 there is no sign of CP852 or any associated country support.

The next beta from January 1991 is most interesting. It offers several previously unsupported countries in its setup:

New country support in Jan ’91 MS-DOS 5.0 beta

Czechoslovakia, Hungary, Poland, and Yugoslavia all needed CP852 and weren’t supported before. The setup offers all those countries but… it does not work! Perhaps the support was not yet complete, or there was a packaging error, but the EGA.CPI file (screen fonts) shipped with the January 1991 MS-DOS beta is actually the same as in the previous MS-DOS 5.0 beta and does not include CP852. So the language support can be configured, but enabling it fails. This clearly indicates that January 1991 was when Microsoft was in the middle of adding the required support.

The next MS-DOS beta from March 1991 includes the expanded EGA.CPI file and supports CP852 properly:

cp852-dos5-9103-640x356.png
Functioning national language support in March 1991 DOS 5.0 beta

Thus the oldest available OS build with functioning CP852 support known to date is… OS/2 2.0 pre-release build 6.123 from February 1991. This build contains the updated character definitions and CP852 support actually works. Unfortunately no older betas of OS/2 2.0 are available for comparison, and it is thus unclear if 6.123 was the first build with CP852 support or not. Given that MS-DOS had only just started getting that support in January 1991, there is a reasonable chance that OS/2 6.123 was indeed the first such build.

cp852-os2-6.123.png
OS/2 6.123 supports CP852

This leaves an open question: Did any earlier IBM products include CP852 support? Perhaps DB2? Or some communication product? It is not impossible.

Backporting CP852

An obvious question occurred to me: Can the extended national language support (NLS) in DOS 5.0 be grafted on top of an earlier DOS version? It appears that it can:

cp852-dos401-hack-640x356.png
CP852 working in IBM DOS 4.01

All it takes is copying over COUNTRY.SYS, KEYBOARD.SYS, and EGA.CPI. That allows DOS 4.0 to support the new countries, keyboards, and code pages added in DOS 5.0, while using the original DISPLAY.SYS, NLSFUNC.EXE, KEYB.COM, and MODE.COM shipped with DOS 4.0.

Note that DOS 4.0 uses the same NLS infrastructure that IBM introduced in DOS 3.3. Correctly configuring DOS NLS is somewhat mind-bending, but that’s a topic for another time.

CP852 Origins

So… where did code page 852 come from, exactly? The obvious place to look would be IBM. According to IBM’s globalization database, CP852 was registered in… 1993. That is not exactly credible, given that DOS 5.0 provably supported CP852 in 1991. Even more confusingly, RFC 1345 says that their source for CP852 is IBM’s NLS reference manual from March 1990. Who do you trust, IBM or people citing IBM?

My attempts to locate an old copy of IBM’s “Green Book” have proven unsuccessful. It is thus very unclear when IBM first published CP852.

An interesting wrinkle is IBM’s definition of character set (not code page!) 982, Latin 2 PC. That character set, which is supposedly primarily contained in CP852, was defined in 1986! Did IBM really sit on it for several years before putting together a code page with that character set? That sounds quite unlikely.

Note that 1986 would be roughly in line with ISO 8859-2 or ECMA-94; the second edition of ECMA-94 defined “Latin Alphabet No. 2” (equivalent of ISO 8859-2) in June 1986. It’s quite likely that IBM was involved in the standardization process.

A well-informed source tells me that the CP852 character bitmaps (that end up in EGA.CPI) as well as Czech, Hungarian, Polish, Slovak, and Yugoslav keyboard layouts for DOS were defined by IBM in 1987. Yes, 1987–the year DOS 3.3 was released, and four years before those keyboard layouts and CP852 would be supported in DOS 5.0.

In the late 1980s, the countries that would typically use CP852 were all behind the Iron Curtain and many of them embargoed. IBM PCs were not commonly available there. So why would IBM care?

There turns out to be some interesting and obscure history. IBM has a lab in Budapest which opened in 1936 and never since closed. IBM also had offices in Prague, Sofia, and Warsaw. Vienna was the headquarters of IBM ROECE, variously given as “Regional Office Europe Central and East” or “Regional Office for Eastern and Central Europe”. IBM ROECE’s territory was essentially the Eastern Bloc minus Soviet Union: Bulgaria, Czechoslovakia, East Germany, Hungary, Poland, Romania, Yugoslavia.

According to an article published in German ComputerWoche in February 1976, IBM ROECE was indeed selling systems to Eastern Bloc countries. Note that the article mentions “over 300” systems installed, but that’s mid-range and mainframe systems, not PCs (obviously). IBM also reportedly had trouble selling larger System/370 machines due to embargoes. Yugoslavia was the biggest buyer because it was not subject to the COMECON embargo.

This was never talked about much by either side. Communist governments didn’t want to talk about buying from evil capitalist Americans because their own technology was inferior, and IBM perhaps didn’t want to boast too much about selling to evil communist Eastern Bloc countries either.

But it does explain why IBM worked on national language support for those countries well before the Berlin Wall fell and the borders opened.

P.S: What’s a Caron, Anyway?

A minor related mystery is that of the caron. Several languages (mostly Slavic) use a diacritical mark which likely originated in Czech and is called “háček” (it’s that funny little thing above the ‘c’), which means “hook”. In current international standards, the diacritical mark is called “caron”. The problem is that no one understands why, and the whole thing might be a misunderstanding. There are some guesses, but the origin of the term “caron” is completely unclear; there is evidence that it was supposed to designate an inverted caret (^) and not a diacritical mark. It is fascinating to see how mistakes can spread.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK