![](/style/images/good.png)
![](/style/images/bad.png)
Idiosyncratic Ruby: Ruby has Character
source link: https://idiosyncratic-ruby.com/66-ruby-has-character.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Ruby comes with good support for Unicode-related features. Read on if you want to learn more about important Unicode fundamentals and how to use them in Ruby…
…or just watch my talk from RubyConf 2017:
⑩ Unicode Characters You Should Know About as a 👩💻
Ruby ♡ Unicode
Characters in Unicode
Unicode has come a long way and is now available in version 13.0 (core specification). The standard defines a lot of things related to characters, however, it is not always easy to grasp what a character actually is. Is DŽ a single character or not? What about non-Latin languages?
We will need some more fine-grained concepts to distinguish and talk about characters in Unicode:
- Codepoint: A base unit to construct characters from. Often this maps directly to a single character. Depending on the encoding, a codepoint might require multiple bytes.
- Grapheme cluster: Smallest linguistic unit, a user-perceived character, constructed out of one or multiple codepoints.
- Glyph: The actual rendered shape which represents the grapheme cluster
Codepoints & Encodings
Codepoints are the base unit of Unicode: It is a number mapped to some meaning. Often this resolves to a single character:
"\u{41}" # => "A"
"\u{ABCD}" # => "ꯍ"
"\u{1F6A1}" # => "🚡"
There are 1114112 (in hexadecimal: 0x110000) different codepoints. On byte-level, a codepoint can be represented in different ways, which depends on the encoding used. Popular encodings for Unicode are UTF-8, UTF-16, and UTF-32, which all have different mechanisms of representing codepoints:
Codepoint
Decimal
Glyph
Bytes UTF-8
Bytes UTF-16LE
Bytes UTF-32LE
U+0041
65
A
41
41 00
41 00 00 00
U+ABCD
43981
ꯍ
EA AF 8D
CD AB
CD AB 00 00
U+1F6A1
128673
🚡
F0 9F 9A A1
3D D8 A1 DE
A1 F6 01 00
Here is an overview, without going into too much detail:
- UTF-8 uses a dynamic number of bytes: While ASCII characters fit into a single byte, it can use up to 4 bytes for higher codepoints.
- UTF-16 uses 2 bytes, if possible, but has a 4 byte mechanism to represent higher codepoints.
- UTF-32 is a direct representation of the codepoint and always uses 4 bytes, no logic is involved. It is also a little lavish, because even the largest codepoint U+10FFFF only uses 21 bit of information. As a consequence the last byte is always
00
.
You can visualize and learn about encodings on the command-line with the unibits CLI utility.
The rest of this blog post will not deal with encodings and byte representations, but use codepoints as the smallest unit.
Grapheme Clusters
A user-perceived character might be constructed out of multiple codepoints. There are a lot of enclosing characters (like diacritics) which get combined with the previous character to form a new one:
"Ä" = U+0041 "A" + U+0308 "◌̈"
An example from the Thai language:
"กำ" = U+0E01 "ก" + U+0E33 " ำ"
Emoji are another example of grapheme clusters that require multiple codepoints:
"👨🏻🍳"¹ = U+1F468 "👨" + U+1F3FB "🏻" + U+0200D "" + U+1F373 "🍳"
Ruby 2.5 introduced a convenient way to iterate through all grapheme clusters:
"abกำcd".grapheme_clusters # => ["a", "b", "กำ", "c", "d"]
There is also /\X/
², a regex feature that you can use instead of the default /./
to match for grapheme clusters instead of codepoints:
"abกำcd".scan(/./) # => ["a", "b", "ก", "ำ", "c", "d"]
"abกำcd".scan(/\X/) # => ["a", "b", "กำ", "c", "d"]
¹ Depending on the recentness of your rendering software, this is displayed as a single male cook
² This regex matcher was already introduced in earlier versions of Ruby
Normalization
![keyboard.jpg](https://idiosyncratic-ruby.com/images/keyboard.jpg)
Sometimes, the Unicode standard defines multiple ways to describe the same (or a very similar) glyph. Let us revisit the example from above: the German letter "Ä", which is a "A" with two dots above. It is defined as codepoint U+00C4. At the same time, there is a mechanism to put two dots above just any letter using the combining codepoint U+0308. Combine it with "A" and you get "Ä" - A different representation, although semantically, it is the same character.
However, sometimes you need one canonical representation of a string. This is why the Unicode consortium came up with a normalization algorithm. It is included in Ruby's standard library and required automatically. There are several types of normalization forms:
Form Description NFC Default. The C stands for composed, it uses the composed format for graphemes (if available). NFD The D stands for decomposed, it uses separate codepoints for such graphemes NFKC Like NFC, but uses compatibility mode, instead of canonical mode NFKD Like NFD, but uses compatibility mode, instead of canonical mode
NFC
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize.codepoints.map{|c| "U+%04X"%c }
# => ["U+00C4"]
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize.codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]
"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize.codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+00B2"]
NFD
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]
"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+00B2"]
NFKC
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]
"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0032"]
NFKD
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]
"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]
"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0032"]
See the standard and documentation for more details, including the differences between the normalization forms:
Special Case: Visual Confusable Characters
Even in normalization form, there are characters which look very similar (sometimes even identical):
Codepoints A
String A
String B
Codepoints B
U+003F + U+003F
??
⁇
U+2047
U+0043
C
С
U+0421
U+0031
1
l
U+006C
The record holder is LATIN SMALL LETTER O which is currently linked to 75 other characters that it could be confused with:
Detecting confusable characters is not built-in, it is possible with some gem support from unicode-confusable:
require "unicode/confusable"
Unicode::Confusable.confusable? "ℜ𝘂ᖯʏ", "Ruby" # => true
Case-Mapping
Another Unicode topic is converting a word from lowercase to uppercase or vice versa. Up until Ruby 2.3, string methods like #upcase
,#capitalize
, #downcase
, or #swapcase
would just not work with non-ASCII characters:
"ä".upcase # => "ä" # Ruby 2.3
This has been fixed and more recent versions of Ruby are able to do this out of the box:
"ä".upcase # => "Ä"
The old, ASCII-only behavior can be achieved by passing the :ascii
option:
"ä".upcase(:ascii) # => "ä"
This is already much better than before, however, keep in mind that case-mapping is a locale-dependent operation! Not all languages use the same rules for converting between lower- and uppercase. For example, in most languages, the uppercase version of letter i is I:
"i".upcase # => "I"
However, in Turkic languages, it's the letter İ
:
"i".upcase(:turkic) # => "İ"
Although Ruby supports special local case mapping rules, as of Ruby 2.5.1, only :turkic
is supported. More options might be supported in the future.
Special Case: Case-Folding
There is another special option that you can pass to the String#downcase
method: The :fold
symbol. It will turn on case-folding, which should be used instead of the default case-mapping behavior if you are interested in comparing/ordering strings. The case-folding algorithm might produce a different output than the case-mapping one. Fer example, the German letter sharp s should be treated like two normal s letters in comparisons:
"ẞ".downcase # => "ß"
"ẞ".downcase(:fold) # => "ss"
There is another String method in Ruby core which makes use of case-folding: String#casecmp?¹ which compares two strings ignoring their case:
"A".casecmp? "a" # => true
"ẞ".casecmp? "ss" # => true
¹ You should pay attention that its sister method String#casecmp only uses ASCII, despite the similar naming.
Regex Unicode Property Matching
Ruby's regex engine supports matching of Unicode characteristics, like a characters general purpose (general category), its script, or in which codepoint range it is defined (block):
"String, with: punctuation.".scan(/\p{P}/) # => [",", ":", "."]
See my previous articles for more details:
- Episode 41: Proper Unicoding - More about the Unicode property syntax
- Episode 30: Regex with Class - Unicode behavior of regex matchers & POSIX-style character classes
Special Case: Emoji Matching
Detecting emoji is especially complicated, because there are multiple mechanisms to build up the final emoji glyph. You can use the unicode-emoji gem to find all kinds of emoji:
require "unicode/emoji"
"😴 🛌🏽 🇵🇹 🤾🏽♀️".scan(Unicode::Emoji::REGEX) # => ["😴", "🛌🏽", "🇵🇹", "🤾🏽♀️"]
Monospace Display-Width
Sometimes, you might find yourself in a situation where you would like to know the width of a character. But this is not easily possible, because the character width is just not defined! This, of course, leads to problems in fixed-width environments like terminals.
If you don't believe me, here are some wide characters for you to checkout:
Codepoint Glyph Name U+1242B 𒐫 CUNEIFORM NUMERIC SIGN NINE SHAR2 U+12219 𒈙 CUNEIFORM SIGN LUGAL OPPOSING LUGAL U+A9C4 ꧄ JAVANESE PADA MADYA U+2E3B ⸻ THREE-EM DASH U+2031 ‱ PER TEN THOUSAND SIGN
To complicate things further, some Asian characters are marked as ambiguous and get displayed wide or narrow, depending on the software displaying them. The unicode-display_width can help:
require "unicode/display_width"
Unicode::DisplayWidth.of("⚀") # => 1
Unicode::DisplayWidth.of("一") # => 2
# Ambiguous example
Unicode::DisplayWidth.of("·", 1) # => 1
Unicode::DisplayWidth.of("·", 2) # => 2
Unicode Special Codepoints
The last section will put the focus on four types of codepoints that require some attention. This is just a selection, there are many more notable codepoints and a good starting point to dig deeper is the Awesome Codepoints list!
Invalid Codepoints
There are two kinds of codepoints which are invalid. If you have these in your data, the data is invalid and String#valid_encoding? will return false. Both of them are encoding-related:
UTF-16 Surrogates
The four byte mechanism that UTF-16 uses to represent codepoints higher than U+FFFF (= 65 535) needs auxiliary codepoints. These are U+D800..U+DFFF and they are strictly forbidden in UTF-8 and UTF-32.
Too Large Codepoints
Any codepoint above U+10FFFF (= 1 114 111) is not allowed. The theoretical UTF-32 maximum is U+FFFFFFFF (= 4 294 967 295) and four byte UTF-8 could represent codepoints upto U+1FFFFF (= 2 097 151).
Ruby does not let you create these from literals:
"\u{D800}" # => SyntaxError: (irb):52: invalid Unicode codepoint
"\u{110000}" # => SyntaxError: (irb):54: invalid Unicode codepoint (too large)
But, if you really need to…, you can use Array#pack:
[0xD800].pack("U") # => "\xED\xA0\x80"
[0x110000].pack("U") # => "\xF4\x90\x80\x80"
Ruby also includes a useful method that removes all invalid bytes, for example, surrogates:
"a\xED\xA0\x80b" # => "a\xED\xA0\x80b"
"a\xED\xA0\x80b".scrub # => "a���b"
"a\xED\xA0\x80b".scrub("") # => "ab"
Unstandardized Codepoints
Another group of codepoints that require extra care are the unstandardized ones. When you look at the following diagram, you will see that a lot of codepoints actually do not have a meaning assigned by the consortium (yet):
Codepoint Distribution as of Unicode 10
Reserved837775Private Use137468Standardized138869
Types of Unstandardized Codepoints
- Private-Use Codepoints: Meant for custom allocations by anyone. You will find vendor logos here, for example, U+F8FF for the Apple logo "" and U+F200 for the ubuntu logo "". Both may only display correctly on the respective operating system with a proper font). Other uses of the private plane include assigning codepoints to fantasy languages like Tengwar by J.R.R. Tolkien.
- Non-Characters: A handful of codepoints that will never be assigned. Different than invalid codepoints, they are allowed to be used in your data. But they have no meaning.
- Reserved Codepoints: Will (or might) be assigned in a later version of Unicode
/\p{private use}
Non-Characters
66
U+FDD0..U+FDEF and the last two codepoints of each plane: U+XFFFE, U+XFFFF
/\p{nchar}/
Reserved
837 775
(not yet assigned)
/\p{unassigned}(?<!\p{nchar})/
¹ Two additional private-use codepoints are U+0091 and U+0092, but they are counted as control characters (see next section)
Control Characters
For historical reasons Unicode includes a set of 65 control codepoints. They were not defined by the Unicode Consortium and a lot of them are not universally standardized. However, some of them are extremely common, such as U+0009, the tab-stop character. It also contains the newline characters U+0010 "\n" and U+0013 "\r"; depending on your operating system, use one or both of them for a newline.
Control characters are divided into the two sections C0, covering U+0000..U+001F, and C1, covering U+0080..U+009F. Furthermore, the delete character U+007F ␡ is also considered to be a control character.
In regexes, you can match for control characters with \p{control}
or just \p{cc}
.
List of C0 Control Codepoints
Codepoint
Symbol
Ruby Escape
Name
U+0000
␀ NUL
\0
NULL
U+0001
␁ SOH
\u{1}
START OF HEADING
U+0002
␂ STX
\u{2}
START OF TEXT
U+0003
␃ ETX
\u{3}
END OF TEXT
U+0004
␄ EOT
\u{4}
END OF TRANSMISSION
U+0005
␅ ENQ
\u{5}
ENQUIRY
U+0006
␆ ACK
\u{6}
ACKNOWLEDGE
U+0007
␇ BEL
\a
ALERT
U+0008
␈ BS
\b
BACKSPACE
U+0009
␉ HT
\t
CHARACTER TABULATION
U+000A
␊ LF
\n
LINE FEED
U+000B
␋ VT
\v
LINE TABULATION
U+000C
␌ FF
\f
FORM FEED
U+000D
␍ CR
\r
CARRIAGE RETURN
U+000E
␎ SS
\u{e}
SHIFT OUT
U+000F
␏ SI
\u{f}
SHIFT IN
U+0010
␐ DLE
\u{10}
DATA LINK ESCAPE
U+0011
␑ DC1
\u{11}
DEVICE CONTROL ONE
U+0012
␒ DC2
\u{12}
DEVICE CONTROL TWO
U+0013
␓ DC3
\u{13}
DEVICE CONTROL THREE
U+0014
␔ DC4
\u{14}
DEVICE CONTROL FOUR
U+0015
␕ NAK
\u{15}
NEGATIVE ACKNOWLEDGE
U+0016
␖ SYN
\u{16}
SYNCHRONOUS IDLE
U+0017
␗ ETB
\u{17}
END OF TRANSMISSION BLOCK
U+0018
␘ CAN
\u{18}
CANCEL
U+0019
␙ EM
\u{19}
END OF MEDIUM
U+001A
␚ SUB
\u{1a}
SUBSTITUTE
U+001B
␛ ESC
\e
ESCAPE
U+001C
␜ FS
\u{1c}
INFORMATION SEPARATOR FOUR
U+001D
␝ GS
\u{1d}
INFORMATION SEPARATOR THREE
U+001E
␞ RS
\u{1e}
INFORMATION SEPARATOR TWO
U+001F
␟ US
\u{1f}
INFORMATION SEPARATOR ONE
List of C1 Control Codepoints
Codepoint
Symbol
Ruby Escape
Name
U+0080
PAD
\u{80}
PADDING CHARACTER
U+0081
HOP
\u{81}
HIGH OCTET PRESET
U+0082
BPH
\u{82}
BREAK PERMITTED HERE
U+0083
NBH
\u{83}
NO BREAK HERE
U+0084
IND
\u{84}
INDEX
U+0085
NEL¹
\u{85}
NEXT LINE¹
U+0086
SSA
\u{86}
START OF SELECTED AREA
U+0087
ESA
\u{87}
END OF SELECTED AREA
U+0088
HTS
\u{88}
CHARACTER TABULATION SET
U+0089
HTJ
\u{89}
CHARACTER TABULATION WITH JUSTIFICATION
U+008A
VTS
\u{8a}
LINE TABULATION SET
U+008B
PLD
\u{8b}
PARTIAL LINE FORWARD
U+008C
PLU
\u{8c}
PARTIAL LINE BACKWARD
U+008D
RI
\u{8d}
REVERSE LINE FEED
U+008E
SS2
\u{8e}
SINGLE SHIFT TWO
U+008F
SS3
\u{8f}
SINGLE SHIFT THREE
U+0090
DCS
\u{90}
DEVICE CONTROL STRING
U+0091
PU1
\u{91}
PRIVATE USE ONE
U+0092
PU2
\u{92}
PRIVATE USE TWO
U+0093
STS
\u{93}
SET TRANSMIT STATE
U+0094
CCH
\u{94}
CANCEL CHARACTER
U+0095
MW
\u{95}
MESSAGE WAITING
U+0096
SPA
\u{96}
START OF GUARDED AREA
U+0097
EPA
\u{97}
END OF GUARDED AREA
U+0098
SOS
\u{98}
START OF STRING
U+0099
SGC
\u{99}
SINGLE GRAPHIC CHARACTER INTRODUCER
U+009A
SCI
\u{9a}
SINGLE CHARACTER INTRODUCER
U+009B
CSI
\u{9b}
CONTROL SEQUENCE INTRODUCER
U+009C
ST
\u{9c}
STRING TERMINATOR
U+009D
OSC
\u{9d}
OPERATING SYSTEM COMMAND
U+009E
PM
\u{9e}
PRIVACY MESSAGE
U+009F
APC
\u{9f}
APPLICATION PROGRAM COMMAND
¹ The NEXT LINE control character was introduced to have an universal codepoint for newlines. This goal was not reached. Still, on some systems (for example, my ubuntu machine), it will actually create a newline!
The characteristics gem lets you check if a codepoint belongs to a specific control group:
Characteristics.create("\u{80}").c0? # => false
Characteristics.create("\u{80}").c1? # => true
Ignorable Codepoints
My last example of special codepoints are the so called ignorable codepoints. Their meaning varies, but most of them are invisible and they are often not treated as a whitespace by Unicode. They are ignorable in the sense that if your Unicode rendering engine does not know how to display it, it should just display nothing. The ignorable property is even given to some ranges of unassigned codepoints¹ (which is usually not done).
You can check for ignorable codepoints using the /\p{default ignorable code point}/
(or its shorthand \p{di}
) regex.
For example, the following piece of code is made out of tag characters, which resemble all ASCII characters, but as ignorable characters:
eval "".codepoints.map{ |c| c - 0xE0000 }.pack("U*")
This program will output Idiosyncratic Unicode
¹ The whole range of E0000..E0FFF is ignorable!
CLI Tools for Codepoint Analysis
I hope that you are now ready to closely inspect your own Unicode data! To help you do so, I made a few command-line tools, I hope you like them:
- uniscribe for codepoint analysis
- unibits for encoding analysis, also supports a lot of non-Unicode encodings
- unicopy for converting & copying codepoints
Also See
More Idiosyncratic Ruby
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK