0

Unicode String Operations

 3 years ago
source link: https://zig.news/dude_the_builder/unicode-string-operations-536e
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Normalization

Let's bring back our old friend from previous posts, the character "é". As we've seen, this character has two possible code point compositions:

  1. The single code point 0xE9, and
  2. The code points 0x65 + 0x301 ("e" plus the acute accent).

Why have different code point combinations to produce the same character? Isn't this just needless complexity, wasting memory and cPU cycles? Well, the reason behind this is to facilitate conversions to and from pre-existing encodings. Due to coding space limitations, many previous encodings usually encoded base characters with modifiers and combining marks (such as the acute accent in "é") as combinations of codes, and if you're converting such an encoding to Unicode, having a direct one-to-one mapping of each of these codes into code points is faster and easier to implement (think: avoiding lookahead buffers).

But now we have a problem. Let's say we have these two strings:

const str_1 = "Jos\u{E9}";
const str_2 = "Jos\u{65}\u{301}";

try std.testing.expectEqualStrings(str_1, str_2); // BOOM!
Enter fullscreen modeExit fullscreen mode

The test doesn't pass because we're comparing the bytes that compose each string, and obviously they're not the same. To drive the significance of this problem even further, let's look at the actual error produced:

====== expected this output: =========
José␃

======== instead found this: =========
José␃

======================================
First difference occurs on line 1:
expected:
José
   ^
found:
José
   ^
Enter fullscreen modeExit fullscreen mode

WAT 🦆? Here lies the real problem, these two strings look exactly the same when rendered visually as glyphs. As programmers, we know the bytes and the code points are different, but normal humans (😹) perceive these strings as equal, and so they should be treated as equivalent.

Unicode Character Equivalence

To deal with this problem, Unicode defines the relationships between characters and their code point compositions. The standard has two types of relationship, Canonical and Compatibility. We won't go into the gory details here, but if you're interested in learning more UAX #15 has all the info you'll ever need.

For the purpose of most day-to-day string comparison needs, all we need to know is that Unicode has precise rules regarding how strings can be compared correctly, taking into account the different composition and decomposition forms that can be present. These rules and accompanying algorithms fall under the umbrella topic called Normalization, and Ziglyph has a Normalizer struct that handles all the details for you to let you compare strings easily.

const Normalizer = @import("ziglhph").Normalizer;
const testing = @import("std").testing;

var allocator = std.testing.allocator;

// Relative path to the Decompositions.bin file.
const decomp_file = "./Decompositions.bin";
var norm = try Normalizer.init(allocator, decomp_file);

const str_1 = "Jos\u{E9}";
const str_2 = "Jos\u{65}\u{301}";
const str_3 = "JOSÉ";

// Normalized, case-sensitive comparison.
try testing.expect(try norm.eqlBy(str_1, str_2, .normalize));

// Normalized, case-insensitive comparison.
try testing.expect(try norm.eqlBy(str_2, str_3, .norm_ignore));
Enter fullscreen modeExit fullscreen mode

The init function takes an allocator and the relative path (relative to where you're executing your final binary) to the Decompositions.bin file. This file is a compressed binary form of the Unicode Character Database's code point decompositions data. See the side note for more details about this file. You can find a copy of it in the src/data/ucd subdirectory of the GitHub repo. The reasoning behind having this file as an external input instead of being embedded directly into the source code was based purely on the original file size of the decompositions data. Now that compression has reduced this size substantially, this design may change in a future version. Let me know what you think!

The Normalizer.eqlBy method takes the two strings to be compared and a comparison mode from the enum CmpMode which includes .normalize for case-sensitive normalized comparisons and .norm_ignore for the case-insensitive mode.

Ziglyph's Normalizer has support for all Unicode Normalization Forms and methods to compose and decompose code points and strings among them. Check out the src/normalizer/Normalizer.zig source code to see what's available.

Unicode Data Differential Compression

Sorting Strings

Given the complexities we've seen so far when it comes to compositions, decompositions, normalization forms and character equivalence; you can imagine that sorting Unicode strings isn't a trivial task either. The Unicode Collation Algorithm (UCA) defines the process by which Unicode strings can be compared to determine their order. This algorithm has to incorporate normalization as a first step towards preparing a level playing field. Once the strings are normalized, a table of weights is used to calculate the total combined weight of a string given its constituent code points. With that total weight at hand, it becomes a matter of comparing it against the total weight of the other string.

The default weights table provided by Unicode is pretty huge, with the original file coming in at about 1.8MiB. The same UDDC algorithm is applied to this data (see side note above), bringing down the size substantially to a modest 99KiB! But enough about the technical details, let's finally see how to sort strings with Ziglyph:

const Collator = @import("ziglyph").Collator;
const testing = @import("std").testing;

var allocator = std.testing.allocator;

// Collation depends on Normalization.
const decomp_file = "./Decompositions.bin";
var norm = try Normalizer.init(allocator, decomp_file);
defer norm.deinit();

// Relative path to allkeys.bin weights file.
const weights_file = "./allkeys.bin";
var collator = try Collator.init(
    allocator,
    weights_file,
    &norm,
);
defer collator.deinit();

// Collation weight levels overview:
//     .primary:   different characters.
//     .secondary: could be same base characters but 
//                 marks (like accents) differ.
//     .tertiary:  same base characters and marks but 
//                 case is different.
// So cab < dab at .primary, and cab < cáb at .secondary, 
// and cáb < Cáb at .tertiary level.

// These methods return true if the string arguments are
// in the specified order.
testing.expect(collator.tertiaryAsc("abc", "def"));
testing.expect(collator.tertiaryDesc("def", "abc"));

// At .primary level, "José" and "jose" are equal because 
// base characters are the same, only marks and case differ, 
// which are weighted at .secondary and .tertiary levels
// respectively. `eq` is a `std.math.Order`.
testing.expect(try collator.orderFn(
    "José",
    "jose",
    .primary,
    .eq,
));

// Full Unicode in-place sort.
var strings: [3][]const u8 = .{ "xyz", "def", "abc" };
collator.sortAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");

// ASCII only sort. If you know the strings are ASCII only,
// this method is faster.
strings = .{ "xyz", "def", "abc" };
collator.sortAsciiAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");
Enter fullscreen modeExit fullscreen mode

Once again we have to pass a relative path to Collator.init pointing to the weights table file allkeys.bin. This too was designed this way due to the file size of the data, but in theory, Unicode allows tailoring of the UCA by providing your own weights table. The jury's still out on this design choice too, so feedback is welcome.

When comparing strings using the Unicode weights, there are three levels providing increasing granularity when performing the comparison.

  1. .primary takes only into consideration the base characters in a string, ignoring case and combining marks, so "cáb" and "Cab" are equal at this level but not "cab" and "dab".
  2. .secondary adds combining marks to the comparison of .primary, so "Cab" and "cab" are equal at this level but not "Cab" and "Cáb".
  3. .tertiary adds case to the comparison so "cáb" and "cáb" are equal but not "Cáb" and "cáb". Pretty much an exact match is required here.

Be sure to check out the source code of src/collator/Collator.zig for more examples.

80 Columns, 80 Characters... Not So Fast Buddy!

To top off this post, we have the issue of character display widths. It just so happens that some characters fit nicely into a single column (or cell) when rendered as a glyph in a fixed-width font, but some other characters need more special consideration. Most control characters and combining marks have width 0. Some even have a negative width (-1) as is the case with Backspace and DEL. Yet other characters have a variable width between 1 and 2 (called half and full) depending on context. There's even the 3-em dash that has a width of 3 (⸻). Ziglyph has the Width namespace with functions to calculate the widths of code points and strings.

const dw = @import("ziglyph").Width;
const testing = @import("std").testing;

// The width methods take a second parameter of value 
// .half or .full to determine the width of "ambiguous"
// code points as per the Unicode standard. .half is the
// most common case.

// Note that codePointWidth returns an i3 because code 
// points like Backspace have width -1.
try testing.expectEqual(dw.codePointWidth('é', .half), 1);
try testing.expectEqual(dw.codePointWidth('😊', .half), 2);
try testing.expectEqual(dw.codePointWidth('统', .half), 2);

var allocator = std.testing.allocator;

// strWidth returns usize because it can never be negative, 
// regardless of the code points contained in the string.
try testing.expectEqual(try dw.strWidth(allocator, "Hello\r\n", .half), 5);
try testing.expectEqual(try dw.strWidth(allocator, "Héllo 🇵🇷", .half), 8);
Enter fullscreen modeExit fullscreen mode

Equipped with a means to calculate proper display width for Unicode strings, we can have some fun with them now:

// padLeft, center, padRight
const right_aligned = try dw.padLeft(allocator, "w😊w", 10, "-");
defer allocator.free(right_aligned);
try testing.expectEqualStrings("------w😊w", right_aligned);

const centered = try dw.center(allocator, "w😊w", 10, "-");
defer allocator.free(centered);
try testing.expectEqualStrings("---w😊w---", centered);

const left_aligned = try dw.padRight(allocator, "w😊w", 10, "-");
defer allocator.free(left_aligned);
try testing.expectEqualStrings("w😊w------", left_aligned);

// Line wrap.
var input = "The quick brown fox\r\njumped over the lazy dog!";
var got = try dw.wrap(
    allocator,
    input,
    10, // Desired line width
    3,  // wrap threshold
);
defer allocator.free(got);
var want = "The quick\n brown \nfox jumped\n over the\n lazy dog\n!";
try testing.expectEqualStrings(want, got);
Enter fullscreen modeExit fullscreen mode

Note that for line wrapping, the wrap function takes the desired number of columns per line and a threshold value that determines how close the last character of the last word can be to the wrapping point, before deciding to wrap at that point or not. You can play with this value to avoid orphaned punctuation or small words on a last line. This function makes use of the WordIterator we saw in the previous post to avoid wrapping within a word. An interesting future project would be hyphenation perhaps?

That's All Folks!

Well, this post has been pretty long and a bit dense, but I hope the Ziglyph functionality presented here can be useful in any project requiring Unicode text processing with Zig. Although quite functional, the library is still a work in progress, so feedback, recommendations, issues, and pull requests are welcome. This is the final post in this series, but a future post on Zigstr is planned for the future, so until then, have fun fellow Ziguanas 🦎!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK