Split grapheme in C#

Background

Before that, I would like to thank my friend: netero, who gave me a lot of help to complete this code.

When we processed the string, we found that we could not accurately obtain the length of the string. So we checked a lot of information and found a lot of related codes, but the results were not very good. Because the version of the Unicode document they used was too old, many character processing errors would occur, so they felt that they had implemented such a function.

this is the project: https://github.com/DebugST/STGraphemeSplitter

Cases

Copy Code

string strText = "abc";
Console.WriteLine(strText.Length) // output is: 3

//But... when there are some special characters... like emoji. .

string strText = "👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈";
Console.WriteLine(strText.Length) // output is: 22

It can be seen that the desired result is 3, but the result is 22. . why is that?

Character clusters

Character clusters refer to text elements that people intuitively and cognitively consider to be individual characters. A character cluster may be an abstract character, or it may be composed of multiple abstract characters. Character clusters should be the basic unit of text operations.

The reason for this situation is: in many compilers, or in memory. The characters are all encoded in Unicode. So when counting the length, it is the number of Unicode codes counted. As we all know, a Unicode is two bytes. Even if all the intervals are used as character encoding, it is only 0x0000-0xFFFF, which is 65536 characters. This interval may not fit all Chinese characters.

Coding Range

So the Unicoe organization thought of a way, that is surrogate . The Unicode organization does not intend to treat all 0x0000-0xFFFF as character ranges

So at this time, the Unicode organization decided to take out the 2048 character interval as surrogate characters.

0xD800-0xDBFF are high surrogate characters. . 0xDC00-0xDFFF are low surrogate characters.

High surrogate characters are usually followed by low surrogate characters. Their codes take out the last 10 bit combinations and add 0x10000 to make a new code, so that there can be more character combinations, as many as 1,048,576.

So such a character requires two Unicode characters.

Copy Code

private static int GetCodePoint(string strText, int nIndex) {
    if (!char.IsHighSurrogate(strText, nIndex)) {
        return strText[nIndex];
    }
    if (nIndex + 1 >= strText.Length) {
        return 0;
    }
    return ((strText[nIndex] & 0x03FF) << 10) + (strText[nIndex + 1] & 0x03FF) + 0x10000;
}

The above mentioned [high surrogate ] is followed by [low surrogate ], so a character is at most two Unicode, which is four bytes? No no no. . . This is not the calculation. . Because the character encodings in different intervals have different properties. . Unicode determines the character clusters based on these properties.

Take the most common characters for example, such as: [\r\n]

Think of it as two characters in a large logarithmic programming language. . Yes. . . He is indeed two characters

But for the human senses, whether it is [\r\n] or [\n], it is always a character, that is [new line]

So [\r\n] is one character in human consciousness, not two.

If you don't do this then the following situation will occur:

Copy Code

string strA = "A\r\nB";
var strB =  strA.Reverse(); // "B\n\rA";

This is not the result we want. The result we want is "B\r\nA", and Unicode is indeed defined as such [GB3]: https://www.unicode.org/reports/tr29/#GB3

Copy Code

Do not break between a CR and LF. Otherwise, break before and after controls.
GB3                     CR   ×   LF
GB4    (Control | CR | LF)   ÷      
GB5                          ÷   (Control | CR | LF)

Characters also have combined attributes, such as: [ā]

It looks like a character, but it is actually a combination of two characters. [a + ̄ = ā] -> "a\u0304"

This is how the 0x0300-0x036F interval is defined in Unicode:

Copy Code

0300..036F    ; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X

So "\u0304" has [Extend] attribute, and [Extend] is defined as follows in the split rule:

Copy Code

Do not break before extending characters or ZWJ.
GB9                          ×    (Extend | ZWJ)

Unicode defines many attributes, and the attributes used to determine the segmentation are as follows:

Copy Code

CR, LF, Control, L, V, LV, LVT, T, Extend, ZWJ, SpacingMark, Prepend, Extended_Pictographic, RI

These attribute distribution intervals are also defined by Unicode:

https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

And, the standard to determine whether these characters should be combined is here:

https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

This code is all written in accordance with the latest Unicode standard. Even if Unicode is updated in the future, the code also provides a code generation function, which can generate the latest code according to the latest Unicode standard. For example:

Copy Code

/// <summary>
/// Build the [GetGraphemeBreakProperty] function and [m_lst_code_range]
/// Current [GetGraphemeBreakProperty] and [m_lst_code_range] create by:
/// https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
/// https://www.unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
/// [Extended_Pictographic] type was not in [GraphemeBreakProperty.txt(14.0.0)]
/// So append [emoji-data.txt] to [GraphemeBreakProperty.txt] to create code
/// </summary>
/// <param name="strText">The text of [GraphemeBreakProperty.txt]</param>
/// <returns>Code</returns>
public static string CreateBreakPropertyCodeFromText(string strText);

Copy Code

string strText = "👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈Abc";
List<string> lst = STGraphemeSplitter.Split(strText);
Console.WriteLine(string.Join(",", lst.ToArray())); //Output: 汉,字,👩‍🦰,👩‍👩‍👦‍👦,🏳️‍🌈,A,b,c

int nLen = STGraphemeSplitter.GetLength(strText);   //Only get length.

foreach (var v in STGraphemeSplitter.GetEnumerator(strText)) {
    Console.WriteLine(v);
}

STGraphemeSplitter.Each(strText, (str, nStart, nLen) => { //faster
    Console.WriteLine(str.Substring(nStart, nLen));
});

//If the above speed is not fast enough? Then create the cache before using
//Creating a cache to an array is relatively fast and takes up a lot of space.
STGraphemeSplitter.CreateArrayCache();
//It is relatively slow to create a cache to the dictionary, and the temporary space is small.
STGraphemeSplitter.CreateDictionaryCache();
STGraphemeSplitter.ClearCache();                //Clear all cache

Background

Cases

Character clusters

Coding Range

Recommend

腾讯员工福利再次升级

广东规范双11促销活动

It also helps to drive new use cases for their hardware, which of course spurs o...

At this rate we're going to run out of good tech journalists. They're all ending...

Global Agricultural Micronutrients Market will be US$ 10.7 Billion by 2027

腾讯音乐发布Q3财报

And therein lies the scope of this issue. There's also the issue of 'If a machi...

微信支持导出个人信息

NVIDIA has more software engineers than hardware engineers, and that's clearly g...

Isn't it more that Intel is using the ISO power metric here? So you compare 10-1...

About Joyk