Unicode Is Awesome
source link: https://wisdom.engineering/awesome-unicode/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A curated list of delightful Unicode tidbits, packages and resources.
Foreword
Unicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts - some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human - and ultimately underappreciated.
Contents
- Quick Unicode Background
- What Characters Does the Unicode Standard Include?
- Unicode Character Encodings
- UTF-16 Surrogate Pairs
- Calculating Surrogate Pairs
- Composing & Decomposing
- Applied Unicode Encodings
- Awesome Characters List
- Variable identifiers can effectively include whitespace!
- Uppercase Transformation Collisions
- Lowercase Transformation Collisions
- Quirks and Troubleshooting
- One-To-Many Case Mappings
- Awesome Packages & Libraries
- Creatively Naming Variables and Methods
- Recursive HTML Tag Renaming Script
- Exploring Deeper into Unicode Yourself
-
- A map of the Basic Multilingual Plane
- Principles of the Unicode Standard
Quick Unicode Background
What Characters Does the Unicode Standard Include?
The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.
The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world's alphabets, ideograph sets, and symbol collections.
The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.
The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.
Unicode Character Encodings
Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.
The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.
Lets talk Numbers
The Unicode characterset is divided into 17 core segments called "planes", which are further divided into blocks. Each plane has space for 65,536 (2¹⁶) codepoints, supporting a grand total of 1,114,112 codepoints. There are two "Private Use Area" planes (#16 & #17) that are allocated to be used however one wishes. These two Private Use planes account for 131,072 codepoints.
# Name Range 1. Basic Multilingual Plane (U+0000 to U+FFFF) 2. Supplementary Multilingual Plane (U+10000 to U+1FFFF) 3. Supplementary Ideographic Plane (U+20000 to U+2FFFF) 4. Tertiary Ideographic Plane (U+30000 to U+3FFFF) 5. Plane 5 (unassigned) (U+40000 to U+4FFFF) 6. Plane 6 (unassigned) (U+50000 to U+5FFFF) 7. Plane 7 (unassigned) (U+60000 to U+6FFFF) 8. Plane 8 (unassigned) (U+70000 to U+7FFFF) 9. Plane 9 (unassigned) (U+80000 to U+8FFFF) 10. Plane 10 (unassigned) (U+90000 to U+9FFFF) 11. Plane 11 (unassigned) (U+A0000 to U+AFFFF) 12. Plane 12 (unassigned) (U+B0000 to U+BFFFF) 13. Plane 13 (unassigned) (U+C0000 to U+CFFFF) 14. Plane 14 (unassigned) (U+D0000 to U+DFFFF) 15. Supplementary Special-purpose Plane (U+E0000 to U+EFFFF) 16. Supplementary Private Use Area - A (U+F0000 to U+FFFFF) 17. Supplementary Private Use Area - B (U+100000 to U+10FFFF)The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters. The other sixteen planes (U+010000 → U+10FFFF) are called supplementary planes or astral planes.
UTF-16 Surrogate Pairs
Characters outside the BMP, e.g. U+1D306 tetragram for centre ( ), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.
The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
Surrogate pair: A representation for a single abstract character that consists of a
sequence of two 16-bit code units, where the first value of the pair is a high-surrogate
code unit and the second value is a low-surrogate code unit. Surrogate pairs are used only in UTF-16.
Calculating Surrogate Pairs
The Unicode character :hankey: Pile of Poo (U+1F4A9) in UTF-16 must be encoded as a surrogate pair, i.e. two surrogates. To convert any code point to a surrogate pair, use the following algorithm (in JavaScript). Keep in mind that we're using hexidecimal notation.
var High_Surrogate = function(Code_Point){ return Math.floor((Code_Point - 0x10000) / 0x400) + 0xD800 }; var Low_Surrogate = function(Code_Point){ return (Code_Point - 0x10000) % 0x400 + 0xDC00 }; // Reverses The Conversion var Code_Point = function(High_Surrogate, Low_Surrogate){ return (High_Surrogate - 0xD800) * 0x400 + Low_Surrogate - 0xDC00 + 0x10000; };
> var codepoint = 0x1F4A9; // 0x1F4A9 == 128169 > High_Surrogate(codepoint).toString(16) "d83d" // 0xD83D == 55357 > Low_Surrogate(codepoint).toString(16) "dca9" // 0xDCA9 == 56489 > String.fromCharCode( High_Surrogate(codepoint) , Low_Surrogate(codepoint) ); ":hankey:" > String.fromCodePoint(0x1F4A9) ":hankey:" > '\ud83d\udca9' ":hankey:"
Composing & Decomposing
Unicode includes a mechanism for modifying character shape that greatly extends the supported glyph repertoire. This covers the use of combining diacritical marks. They are inserted after the main character. Multiple combining diacritics may be stacked over the same character. Unicode also contains precomposed versions of most letter/diacritic combinations in normal use.
Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".
Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character "ü" may be decomposed into a "u" followed by the non-spacing character "¨". Once the character has been decomposed, it may be easier for the collation to work with the character because it can be processed as a "u" with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. It also defines normalization forms to provide for unique representations of characters.
Myths of Unicode
From Mark Davis's Unicode Myths slides.
-
Unicode is simply a 16-bit code- Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.
-
You can use any unassigned codepoint for internal use- No. Eventually that hole will be filled with a different character. Instead use private use or noncharacters.
-
Every Unicode code point represents a character- No. There are lots of nonCharacters (FFFE, FFFF, 1FFFE,…)
There are also surrogate code points, private and unassigned codepoints, and control/format “characters" (RLM, ZWNJ,…)
-
Unicode will run out of space- If it were linear, we would run out in 2140 AD. But it isn't linear. See https://www.unicode.org/roadmaps/
-
Case mappings are 1-1- No. They can also be:
- One-to-many: (ß → SS )
- Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
- Locale-sensitive: ( I ↔ ı AND İ ↔ i )
Applied Unicode Encodings
Encoding Type Raw Encoding HTML Entity (Decimal) :vulcan_salute: HTML Entity (Hexadecimal) :vulcan_salute: URL Escape Code %F0%9F%96%96 UTF-8 (hex) 0xF0 0x9F 0x96 0x96 (f09f9696) UTF-8 (binary) 11110000:10011111:10010110:10010110 UTF-16/UTF-16BE (hex) 0xD83D 0xDD96 (d83ddd96) UTF-16LE (hex) 0x3DD8 0x96DD (3dd896dd) UTF-32/UTF-32BE (hex) 0x0001F596 (0001f596) UTF-32LE (hex) 0x96F50100 (96f50100) Octal Escape Sequence \360\237\226\226Source Code
Encoding Type Raw Encoding JavaScript \u1F596 JSON \u1F596 C \u1F596 C++ \u1F596 Java \u1F596 Python \u1F596 Perl \x{1F596} Ruby \u{1F596} CSS \01F596Awesome Characters List
Special Characters
The Unicode Consortium published a general punctuation chart where you can find more details.
Char Name Description''
U+FEFF (Byte Order Mark - BOM)
has the important property of unambiguity on byte reorder. It is also zerowidth, and invisible. In non-complying software (like the PHP interpreter) this leads to all sorts of fun behaviour.
''
'\uFFEF' Reversed Byte Order Mark (BOM)
does not equate to a legal character, other than the beginning of text.
''
'\u200B' zero-width non-break space
(a character with no appearance and no effect other than preventing the formation of ligatures).
' '
U+00A0 NO-BREAK SPACE
force adjacent characters to stick together. Well known as
in HTML.
''
U+00AD SOFT HYPHEN
(in HTML: ) like ZERO WIDTH SPACE, but show a hyphen if (and only if) a break occurs.
''
U+200D ZERO WIDTH JOINER
force adjacent characters to be joined together (e.g., arabic characters or supported emoji). Can be used this to compose sequentially combined emoji.
''
U+2060 WORD JOINER
the same as U+00A0, but completely invisible. Good for writing @font-face on Twitter.
' '
U+1680 OGHAM SPACE MARK
a space that looks like a dash. Great to bring programmers close to madness: 1 + 2 === 3.
';'
U+037E GREEK QUESTION MARK
a look-alike to the semicolon. Also a fun way to annoy developers.
''
U+202D
change the text direction to Left-to-Right.
''
U+202E
change the text direction to Right-to-Left:
'ꓸ'
U+A4F8 LISU LETTER TONE MYA TI
A lookalike for the period character.
'ꓹ'
U+A4F9 LISU LETTER TONE NA PO
A lookalike for the comma character.
'ꓼ'
U+A4FC LISU LETTER TONE MYA NA
A lookalike for the semi-colon character.
'ꓽ'
U+A4FD LISU LETTER TONE MYA JEU
A lookalike for the colon character.
'︀'
Variation Selectors ( U+FE00 to U+FE0F & U+E0100 to U+E01EF )
a block of 256 zero width characters that posess the ID_Continue proprerty- meaning they can be used in variable names (not the first letter). What makes these special is the fact that mouse cursors pass over them as they are combining characters - unlike most other zero width characters.
'ᅟ'
U+115F HANGUL CHOSEONG FILLER
In general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start
'ᅠ'
U+1160 HANGUL JUNGSEONG FILLER
Perhaps it produces a space? Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start
'ㅤ'
U+3164 HANGUL FILLER
In general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start
Wait a second... what did I just read?
Variable identifiers can effectively include whitespace!
The U+3164 HANGUL FILLER character displays as an advancing whitespace character. The character is rendered as completely invisible (and non advancing, i.e. "zero width"), if not explicitly supported in rendering . That means the ugly character replacement (�) symbol should never be displayed.
I'm not yet sure why U+3164 was specified to behave this way. Interestingly, U+3164 was added to Unicode in version 1.1 (1993)- so the consortium must have had a lot of time to think it through. Anyway, here are a few examples.
> var ᅟ = 'foo'; undefined > ᅟ 'foo' > var ㅤ= alert; undefined > var foo = 'bar' undefined > if ( foo ===ㅤ`baz` ){} // alert undefined > var varㅤfooㅤ\u{A60C}ㅤπ = 'bar'; undefined > varㅤfooㅤꘌㅤπ 'bar'
**NOTE:** I've tested U+3164 rendering on Ubuntu and OS X with the following: `node`, `php`, `ruby`, `python3.5`, `scala` ,`vim`, `cat`, `chrome`+`github gist`. Atom is the only system that fails by (incorrectly) displaying empty boxes. I have yet to test it out on Emacs and Sublime. From what I understand, the Unicode Consortium will not reassign or rename characters or codepoints, but may be convinced to change character properties like ID_Start/ID_Continue.
Modifiers
The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.
The zero-width non-joiner (ZWNJ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively. This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the words closer together or to connect a word with its morpheme.
> 'a' "a" > 'a\u{0308}' "ä" > 'a\u{20DE}\u{0308}' "a⃞̈" > 'a\u{20DE}\u{0308}\u{20DD}' "a⃞̈⃝" // Modifying Invisible Characters > '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}' "" > '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}'.length 10
Uppercase Transformation Collisions
Char Code Point Output Char ß 0x00DFSS
ı
0x0131
I
ſ
0x017F
S
ff
0xFB00
FF
fi
0xFB01
FI
fl
0xFB02
FL
ffi
0xFB03
FFI
ffl
0xFB04
FFL
ſt
0xFB05
ST
st
0xFB06
ST
Lowercase Transformation Collisions
Char Code Point Output Char K 0x212Ak
Quirks and Troubleshooting
-
String length is typically determined by counting codepoints.This means that surrogate pairs would count as two characters. Combining multiple diacritics may be stacked over the same character.
a + ̈ == ̈a
, increasing length, while only producing a single character. -
Similarily, reversing strings often is a non-trivial task.Again, surrogate pairs and diacritics must be reversed together. ES Reverser provides a pretty good solution.
-
Upper and lower case mappings are not always one-to-one.They can also be:
- One-to-many: (ß → SS )
- Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
- Locale-sensitive: ( I ↔ ı AND İ ↔ i )
One-To-Many Case Mappings
Most of the below characters express their one-to-many case mappings when uppercased- while others should be lowercased. This list should be split up
Code Point Character Name Mapped Character Mapped Code Points U+00DFß
LATIN SMALL LETTER SHARP S
s
, s
U+0073, U+0073
U+0130
İ
LATIN CAPITAL LETTER I WITH DOT ABOVE
i
, ̇
U+0069, U+0307
U+0149
ʼn
LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
ʼ
, n
U+02BC, U+006E
U+01F0
ǰ
LATIN SMALL LETTER J WITH CARON
j
, ̌
U+006A, U+030C
U+0390
ΐ
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ι
, ̈
, ́
U+03B9, U+0308, U+0301
U+03B0
ΰ
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
υ
, ̈
, ́
U+03C5, U+0308, U+0301
U+0587
և
ARMENIAN SMALL LIGATURE ECH YIWN
ե
, ւ
U+0565, U+0582
U+1E96
ẖ
LATIN SMALL LETTER H WITH LINE BELOW
h
, ̱
U+0068, U+0331
U+1E97
ẗ
LATIN SMALL LETTER T WITH DIAERESIS
t
, ̈
U+0074, U+0308
U+1E98
ẘ
LATIN SMALL LETTER W WITH RING ABOVE
w
, ̊
U+0077, U+030A
U+1E99
ẙ
LATIN SMALL LETTER Y WITH RING ABOVE
y
, ̊
U+0079, U+030A
U+1E9A
ẚ
LATIN SMALL LETTER A WITH RIGHT HALF RING
a
, ʾ
U+0061, U+02BE
U+1E9E
ẞ
LATIN CAPITAL LETTER SHARP S
s
, s
U+0073, U+0073
U+1F50
ὐ
GREEK SMALL LETTER UPSILON WITH PSILI
υ
, ̓
U+03C5, U+0313
U+1F52
ὒ
GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
υ
, ̓
, ̀
U+03C5, U+0313, U+0300
U+1F54
ὔ
GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
υ
, ̓
, ́
U+03C5, U+0313, U+0301
U+1F56
ὖ
GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
υ
, ̓
, ͂
U+03C5, U+0313, U+0342
U+1F80
ᾀ
GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
ἀ
, ι
U+1F00, U+03B9
U+1F81
ᾁ
GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
ἁ
, ι
U+1F01, U+03B9
U+1F82
ᾂ
GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
ἂ
, ι
U+1F02, U+03B9
U+1F83
ᾃ
GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
ἃ
, ι
U+1F03, U+03B9
U+1F84
ᾄ
GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
ἄ
, ι
U+1F04, U+03B9
U+1F85
ᾅ
GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
ἅ
, ι
U+1F05, U+03B9
U+1F86
ᾆ
GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
ἆ
, ι
U+1F06, U+03B9
U+1F87
ᾇ
GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
ἇ
, ι
U+1F07, U+03B9
U+1F88
ᾈ
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
ἀ
, ι
U+1F00, U+03B9
U+1F89
ᾉ
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
ἁ
, ι
U+1F01, U+03B9
U+1F8A
ᾊ
GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
ἂ
, ι
U+1F02, U+03B9
U+1F8B
ᾋ
GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
ἃ
, ι
U+1F03, U+03B9
U+1F8C
ᾌ
GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
ἄ
, ι
U+1F04, U+03B9
U+1F8D
ᾍ
GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
ἅ
, ι
U+1F05, U+03B9
U+1F8E
ᾎ
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
ἆ
, ι
U+1F06, U+03B9
U+1F8F
ᾏ
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
ἇ
, ι
U+1F07, U+03B9
U+1F90
ᾐ
GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
ἠ
, ι
U+1F20, U+03B9
U+1F91
ᾑ
GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
ἡ
, ι
U+1F21, U+03B9
U+1F92
ᾒ
GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
ἢ
, ι
U+1F22, U+03B9
U+1F93
ᾓ
GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
ἣ
, ι
U+1F23, U+03B9
U+1F94
ᾔ
GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
ἤ
, ι
U+1F24, U+03B9
U+1F95
ᾕ
GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
ἥ
, ι
U+1F25, U+03B9
U+1F96
ᾖ
GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
ἦ
, ι
U+1F26, U+03B9
U+1F97
ᾗ
GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
ἧ
, ι
U+1F27, U+03B9
U+1F98
ᾘ
GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
ἠ
, ι
U+1F20, U+03B9
U+1F99
ᾙ
GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
ἡ
, ι
U+1F21, U+03B9
U+1F9A
ᾚ
GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
ἢ
, ι
U+1F22, U+03B9
U+1F9B
ᾛ
GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
ἣ
, ι
U+1F23, U+03B9
U+1F9C
ᾜ
GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
ἤ
, ι
U+1F24, U+03B9
U+1F9D
ᾝ
GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
ἥ
, ι
U+1F25, U+03B9
U+1F9E
ᾞ
GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
ἦ
, ι
U+1F26, U+03B9
U+1F9F
ᾟ
GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
ἧ
, ι
U+1F27, U+03B9
U+1FA0
ᾠ
GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
ὠ
, ι
U+1F60, U+03B9
U+1FA1
ᾡ
GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
ὡ
, ι
U+1F61, U+03B9
U+1FA2
ᾢ
GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
ὢ
, ι
U+1F62, U+03B9
U+1FA3
ᾣ
GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
ὣ
, ι
U+1F63, U+03B9
U+1FA4
ᾤ
GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
ὤ
, ι
U+1F64, U+03B9
U+1FA5
ᾥ
GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
ὥ
, ι
U+1F65, U+03B9
U+1FA6
ᾦ
GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
ὦ
, ι
U+1F66, U+03B9
U+1FA7
ᾧ
GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
ὧ
, ι
U+1F67, U+03B9
U+1FA8
ᾨ
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
ὠ
, ι
U+1F60, U+03B9
U+1FA9
ᾩ
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
ὡ
, ι
U+1F61, U+03B9
U+1FAA
ᾪ
GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
ὢ
, ι
U+1F62, U+03B9
U+1FAB
ᾫ
GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
ὣ
, ι
U+1F63, U+03B9
U+1FAC
ᾬ
GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
ὤ
, ι
U+1F64, U+03B9
U+1FAD
ᾭ
GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
ὥ
, ι
U+1F65, U+03B9
U+1FAE
ᾮ
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
ὦ
, ι
U+1F66, U+03B9
U+1FAF
ᾯ
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
ὧ
, ι
U+1F67, U+03B9
U+1FB2
ᾲ
GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
ὰ
, ι
U+1F70, U+03B9
U+1FB3
ᾳ
GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
α
, ι
U+03B1, U+03B9
U+1FB4
ᾴ
GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
ά
, ι
U+03AC, U+03B9
U+1FB6
ᾶ
GREEK SMALL LETTER ALPHA WITH PERISPOMENI
α
, ͂
U+03B1, U+0342
U+1FB7
ᾷ
GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
α
, ͂
, ι
U+03B1, U+0342, U+03B9
U+1FBC
ᾼ
GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
α
, ι
U+03B1, U+03B9
U+1FC2
ῂ
GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
ὴ
, ι
U+1F74, U+03B9
U+1FC3
ῃ
GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
η
, ι
U+03B7, U+03B9
U+1FC4
ῄ
GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
ή
, ι
U+03AE, U+03B9
U+1FC6
ῆ
GREEK SMALL LETTER ETA WITH PERISPOMENI
η
, ͂
U+03B7, U+0342
U+1FC7
ῇ
GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
η
, ͂
, ι
U+03B7, U+0342, U+03B9
U+1FCC
ῌ
GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
η
, ι
U+03B7, U+03B9
U+1FD2
ῒ
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
ι
, ̈
, ̀
U+03B9, U+0308, U+0300
U+1FD3
ΐ
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
ι
, ̈
, ́
U+03B9, U+0308, U+0301
U+1FD6
ῖ
GREEK SMALL LETTER IOTA WITH PERISPOMENI
ι
, ͂
U+03B9, U+0342
U+1FD7
ῗ
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
ι
, ̈
, ͂
U+03B9, U+0308, U+0342
U+1FE2
ῢ
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
υ
, ̈
, ̀
U+03C5, U+0308, U+0300
U+1FE3
ΰ
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
υ
, ̈
, ́
U+03C5, U+0308, U+0301
U+1FE4
ῤ
GREEK SMALL LETTER RHO WITH PSILI
ρ
, ̓
U+03C1, U+0313
U+1FE6
ῦ
GREEK SMALL LETTER UPSILON WITH PERISPOMENI
υ
, ͂
U+03C5, U+0342
U+1FE7
ῧ
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
υ
, ̈
, ͂
U+03C5, U+0308, U+0342
U+1FF2
ῲ
GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
ὼ
, ι
U+1F7C, U+03B9
U+1FF3
ῳ
GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
ω
, ι
U+03C9, U+03B9
U+1FF4
ῴ
GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
ώ
, ι
U+03CE, U+03B9
U+1FF6
ῶ
GREEK SMALL LETTER OMEGA WITH PERISPOMENI
ω
, ͂
U+03C9, U+0342
U+1FF7
ῷ
GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
ω
, ͂
, ι
U+03C9, U+0342, U+03B9
U+1FFC
ῼ
GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
ω
, ι
U+03C9, U+03B9
U+FB00
ff
LATIN SMALL LIGATURE FF
f
, f
U+0066, U+0066
U+FB01
fi
LATIN SMALL LIGATURE FI
f
, i
U+0066, U+0069
U+FB02
fl
LATIN SMALL LIGATURE FL
f
, l
U+0066, U+006C
U+FB03
ffi
LATIN SMALL LIGATURE FFI
f
, f
, i
U+0066, U+0066, U+0069
U+FB04
ffl
LATIN SMALL LIGATURE FFL
f
, f
, l
U+0066, U+0066, U+006C
U+FB05
ſt
LATIN SMALL LIGATURE LONG S T
s
, t
U+0073, U+0074
U+FB06
st
LATIN SMALL LIGATURE ST
s
, t
U+0073, U+0074
U+FB13
ﬓ
ARMENIAN SMALL LIGATURE MEN NOW
մ
, ն
U+0574, U+0576
U+FB14
ﬔ
ARMENIAN SMALL LIGATURE MEN ECH
մ
, ե
U+0574, U+0565
U+FB15
ﬕ
ARMENIAN SMALL LIGATURE MEN INI
մ
, ի
U+0574, U+056B
U+FB16
ﬖ
ARMENIAN SMALL LIGATURE VEW NOW
վ
, ն
U+057E, U+0576
U+FB17
ﬗ
ARMENIAN SMALL LIGATURE MEN XEH
մ
, խ
U+0574, U+056D
Awesome Packages & Libraries
- PhantomScript - :ghost: :flashlight: Invisible JavaScript code execution & social engineering
- ESReverser - A Unicode-aware string reverser written in JavaScript.
- mimic - [ab]using Unicode to create tragedy
- python-ftfy - Given Unicode text, make its representation consistent and possibly less broken.
- vim-troll-stopper - Stop Unicode trolls from messing with your code.
Emojis
- Unicode Consortium's Emoji Chart
- Emojipedia - Information about specific emoji, news blog.
- emojitracker - Realtime emoji use on Twitter.
- World Translation Foundation - A way to promote, explore, and translate the written word into the pictorial alphabet of Emoji.
- Can I Emoji? - Displays the current status of native Emoji support across iOS, Android and Windows.
- How to register an emoji URL
Diversity
The Unicode Consortium has made a huge effort better reflect and incorporate human diversity, including cultural practices. Here is the Consortium's diversity report .
Emojis of mixed gender situations are now available, such as same sex families, holding hands, and kissing. The real kicker are Emoji combined sequences . Basically:
Code Points Recipe Combined U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469 U+1F468 U+200D U+1F468 U+200D U+1F467 U+200D U+1F466Further, emojis now support skin color modifiers.
Five symbol modifier characters that provide for a range of skin tones for human emoji were released in Unicode Version 8.0 (mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations.
Unicode Consortium's Diversity report
Code Name Samples U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 Just follow the desired Emoji with one of the skin color modifiers \u{1F466}\u{1F3FE}
.
+ →
Creatively Naming Variables and Methods
Examples are written in JavaScript (ES6)
In general, characters designated the ID_START property may be used at the beggining of a variable name. Characters designated with the ID_CONTINUE property may be used after the first character of a variable.
function rand(μ,σ){ ... }; String.prototype.reverseⵑ = function(){..}; Number.prototype.isTrueɁ = function(){..}; var WhatDoesThisDoɁɁɁɁ = 42
Here are some really creative variable names from Mathias Bynes
// How convenient! var π = Math.PI; // Sometimes, you just have to use the Bad Parts of JavaScript: var ಠ_ಠ = eval; // Code, Y U NO WORK?! var ლ_ಠ益ಠ_ლ = 42; // How about a JavaScript library for functional programming? var λ = function() {}; // Obfuscate boring variable names for great justice var \u006C\u006F\u006C\u0077\u0061\u0074 = 'heh'; // …or just make up random ones var Ꙭൽↈⴱ = 'huh'; // While perfectly valid, this doesn’t work in most browsers: var foo\u200Cbar = 42; // This is *not* a bitwise left shift (`<<`): var 〱〱 = 2; // This is, though: 〱〱 << 〱〱; // 8 // Give yourself a discount: var price_9̶9̶_89 = 'cheap'; // Fun with Roman numerals var Ⅳ = 4; var Ⅴ = 5; Ⅳ + Ⅴ; // 9 // Cthulhu was here var Hͫ̆̒̐ͣ̊̄ͯ͗͏̵̗̻̰̠̬͝ͅE̴̷̬͎̱̘͇͍̾ͦ͊͒͊̓̓̐_̫̠̱̩̭̤͈̑̎̋ͮͩ̒͑̾͋͘Ç̳͕̯̭̱̲̣̠̜͋̍O̴̦̗̯̹̼ͭ̐ͨ̊̈͘͠M̶̝̠̭̭̤̻͓͑̓̊ͣͤ̎͟͠E̢̞̮̹͍̞̳̣ͣͪ͐̈T̡̯̳̭̜̠͕͌̈́̽̿ͤ̿̅̑Ḧ̱̱̺̰̳̹̘̰́̏ͪ̂̽͂̀͠ = 'Zalgo';
And here's some Unicode CSS Classes from David Walsh
<!-- place this within the document head --> <meta charset="UTF-8" /> <!-- error message --> <div class="ಠ_ಠ">You do not have access to this page.</div> <!-- success message --> <div class="❤">Your changes have been saved successfully!</div>
.ಠ_ಠ { border: 1px solid #f00; } .❤ { background: lightgreen; }
Recursive HTML Tag Renaming Script
If you want to rename all your HTML tags to what appears as nothing, the following script is just what your looking for.
Do note however that HTML does not support all unicode characters.
// U+1160 HANGUL JUNGSEONG FILLER transformAllTags('ᅠ'); // An actual HTML element node designed to look like a comment node, using the U+01C3 LATIN LETTER RETROFLEX CLICK // <ǃ-- name="viewport" content="width=device-width"></ǃ--> transformAllTags('ǃ--'); // or even <ᅠ⃝ transformAllTags('\u{1160}\u{20dd}'); // and for a bonus, all existing tag names will have each character ensquared. h⃞t⃞m⃞l⃞ transformAllTags(); function transformAllTags (newName){ // querySelectorAll doesn't actually return an array. Array.from(document.querySelectorAll('*')) .forEach(function(x){ transformTag(x, newName); }); } function wonky(str){ return str.split('').join('\u{20de}') + '\u{20de}'; } function transformTag(tagIdOrElem, tagType){ var elem = (tagIdOrElem instanceof HTMLElement) ? tagIdOrElem : document.getElementById(tagIdOrElem); if(!elem || !(elem instanceof HTMLElement))return; var children = elem.childNodes; var parent = elem.parentNode; var newNode = document.createElement(tagType||wonky(elem.tagName)); for(var a=0;a<elem.attributes.length;a++){ newNode.setAttribute(elem.attributes[a].nodeName, elem.attributes[a].value); } for(var i= 0,clen=children.length;i<clen;i++){ newNode.appendChild(children[0]); //0...always point to the first non-moved element } newNode.style.cssText = elem.style.cssText; parent.replaceChild(newNode,elem); }
Here is what it does support:
function testBegin(str){ try{ eval(`document.createElement( '${str}' );`) return true; } catch(e){ return false; } } function testContinue(str){ try{ eval(`document.createElement( 'a${str}' );`) return true; } catch(e){ return false; } }
And heres some basic results
// Test if dashes can start an HTML Tag > testBegin('-') < false > testContinue('-') < true > testBegin('ᅠ-') // Prepend dash with U+1160 HANGUL JUNGSEONG FILLER < true
Unicode Fonts
A single TrueType / OpenType font format cannot cover all UTF-8 characters as there is a hard limit of 65535 glyphs in a font. Since there are over 1.1 million UTF-8 glphys, you will need to use a font-family to cover them all.
More Reading
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets - By Joel Spolsky
- What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
- The Unicode Consortium's Recommended Reading List
- Space Yourself - Smashing Magazine's Spacing Guide
- JavaScript has a Unicode Problem
- Creative usernames and Spotify account hijacking
Exploring Deeper into Unicode Yourself
- Shapecatcher - Draw the character you're looking for.
- Confusable Unicode Characters
- Unicode Character Database
- Database Dumps of Codepoints.net
- Unicode Blocks List
- Unicode Character Code Charts
- Unicode Case Charts
- Unicode Normalization Chart
- Unicode FAQ
Overview Map
A map of the Basic Multilingual Plane
Each numbered box represents 256 code points.
The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named "CJK Unified Ideographs".
Unicode Blocks
The Unicode standard arranges groups of characters together in blocks. This is the complete list of blocks across all 17 planes.
Name From To # Codepoints Basic Latin U+0000 U+007F 128 Latin-1 Supplement U+0080 U+00FF 128 Latin Extended-A U+0100 U+017F 128 Latin Extended-B U+0180 U+024F 208 IPA Extensions U+0250 U+02AF 96 Spacing Modifier Letters U+02B0 U+02FF 80 Combining Diacritical Marks U+0300 U+036F 112 Greek and Coptic U+0370 U+03FF 135 Cyrillic U+0400 U+04FF 256 Cyrillic Supplement U+0500 U+052F 48 Armenian U+0530 U+058F 89 Hebrew U+0590 U+05FF 87 Arabic U+0600 U+06FF 255 Syriac U+0700 U+074F 77 Arabic Supplement U+0750 U+077F 48 Thaana U+0780 U+07BF 50 NKo U+07C0 U+07FF 59 Samaritan U+0800 U+083F 61 Mandaic U+0840 U+085F 29 Arabic Extended-A U+08A0 U+08FF 50 Devanagari U+0900 U+097F 128 Bengali U+0980 U+09FF 93 Gurmukhi U+0A00 U+0A7F 79 Gujarati U+0A80 U+0AFF 85 Oriya U+0B00 U+0B7F 90 Tamil U+0B80 U+0BFF 72 Telugu U+0C00 U+0C7F 96 Kannada U+0C80 U+0CFF 87 Malayalam U+0D00 U+0D7F 100 Sinhala U+0D80 U+0DFF 90 Thai U+0E00 U+0E7F 87 Lao U+0E80 U+0EFF 67 Tibetan U+0F00 U+0FFF 211 Myanmar U+1000 U+109F 160 Georgian U+10A0 U+10FF 88 Hangul Jamo U+1100 U+11FF 256 Ethiopic U+1200 U+137F 358 Ethiopic Supplement U+1380 U+139F 26 Cherokee U+13A0 U+13FF 92 Unified Canadian Aboriginal Syllabics U+1400 U+167F 640 Ogham U+1680 U+169F 29 Runic U+16A0 U+16FF 89 Tagalog U+1700 U+171F 20 Hanunoo U+1720 U+173F 23 Buhid U+1740 U+175F 20 Tagbanwa U+1760 U+177F 18 Khmer U+1780 U+17FF 114 Mongolian U+1800 U+18AF 156 Unified Canadian Aboriginal Syllabics Extended U+18B0 U+18FF 70 Limbu U+1900 U+194F 68 Tai Le U+1950 U+197F 35 New Tai Lue U+1980 U+19DF 83 Khmer Symbols U+19E0 U+19FF 32 Buginese U+1A00 U+1A1F 30 Tai Tham U+1A20 U+1AAF 127 Combining Diacritical Marks Extended U+1AB0 U+1AFF 15 Balinese U+1B00 U+1B7F 121 Sundanese U+1B80 U+1BBF 64 Batak U+1BC0 U+1BFF 56 Lepcha U+1C00 U+1C4F 74 Ol Chiki U+1C50 U+1C7F 48 Sundanese Supplement U+1CC0 U+1CCF 8 Vedic Extensions U+1CD0 U+1CFF 41 Phonetic Extensions U+1D00 U+1D7F 128 Phonetic Extensions Supplement U+1D80 U+1DBF 64 Combining Diacritical Marks Supplement U+1DC0 U+1DFF 58 Latin Extended Additional U+1E00 U+1EFF 256 Greek Extended U+1F00 U+1FFF 233 General Punctuation U+2000 U+206F 111 Superscripts and Subscripts U+2070 U+209F 42 Currency Symbols U+20A0 U+20CF 31 Combining Diacritical Marks for Symbols U+20D0 U+20FF 33 Letterlike Symbols U+2100 U+214F 80 Number Forms U+2150 U+218F 60 Arrows U+2190 U+21FF 112 Mathematical Operators U+2200 U+22FF 256 Miscellaneous Technical U+2300 U+23FF 251 Control Pictures U+2400 U+243F 39 Optical Character Recognition U+2440 U+245F 11 Enclosed Alphanumerics U+2460 U+24FF 160 Box Drawing U+2500 U+257F 128 Block Elements U+2580 U+259F 32 Geometric Shapes U+25A0 U+25FF 96 Miscellaneous Symbols U+2600 U+26FF 256 Dingbats U+2700 U+27BF 192 Miscellaneous Mathematical Symbols-A U+27C0 U+27EF 48 Supplemental Arrows-A U+27F0 U+27FF 16 Braille Patterns U+2800 U+28FF 256 Supplemental Arrows-B U+2900 U+297F 128 Miscellaneous Mathematical Symbols-B U+2980 U+29FF 128 Supplemental Mathematical Operators U+2A00 U+2AFF 256 Miscellaneous Symbols and Arrows U+2B00 U+2BFF 206 Glagolitic U+2C00 U+2C5F 94 Latin Extended-C U+2C60 U+2C7F 32 Coptic U+2C80 U+2CFF 123 Georgian Supplement U+2D00 U+2D2F 40 Tifinagh U+2D30 U+2D7F 59 Ethiopic Extended U+2D80 U+2DDF 79 Cyrillic Extended-A U+2DE0 U+2DFF 32 Supplemental Punctuation U+2E00 U+2E7F 67 CJK Radicals Supplement U+2E80 U+2EFF 115 Kangxi Radicals U+2F00 U+2FDF 214 Ideographic Description Characters U+2FF0 U+2FFF 12 CJK Symbols and Punctuation U+3000 U+303F 64 Hiragana U+3040 U+309F 93 Katakana U+30A0 U+30FF 96 Bopomofo U+3100 U+312F 41 Hangul Compatibility Jamo U+3130 U+318F 94 Kanbun U+3190 U+319F 16 Bopomofo Extended U+31A0 U+31BF 27 CJK Strokes U+31C0 U+31EF 36 Katakana Phonetic Extensions U+31F0 U+31FF 16 Enclosed CJK Letters and Months U+3200 U+32FF 254 CJK Compatibility U+3300 U+33FF 256 CJK Unified Ideographs Extension A U+3400 U+4DBF 6191 Yijing Hexagram Symbols U+4DC0 U+4DFF 64 CJK Unified Ideographs U+4E00 U+9FFF 20941 Yi Syllables U+A000 U+A48F 1165 Yi Radicals U+A490 U+A4CF 55 Lisu U+A4D0 U+A4FF 48 Vai U+A500 U+A63F 300 Cyrillic Extended-B U+A640 U+A69F 96 Bamum U+A6A0 U+A6FF 88 Modifier Tone Letters U+A700 U+A71F 32 Latin Extended-D U+A720 U+A7FF 159 Syloti Nagri U+A800 U+A82F 44 Common Indic Number Forms U+A830 U+A83F 10 Phags-pa U+A840 U+A87F 56 Saurashtra U+A880 U+A8DF 81 Devanagari Extended U+A8E0 U+A8FF 30 Kayah Li U+A900 U+A92F 48 Rejang U+A930 U+A95F 37 Hangul Jamo Extended-A U+A960 U+A97F 29 Javanese U+A980 U+A9DF 91 Myanmar Extended-B U+A9E0 U+A9FF 31 Cham U+AA00 U+AA5F 83 Myanmar Extended-A U+AA60 U+AA7F 32 Tai Viet U+AA80 U+AADF 72 Meetei Mayek Extensions U+AAE0 U+AAFF 23 Ethiopic Extended-A U+AB00 U+AB2F 32 Latin Extended-E U+AB30 U+AB6F 54 Cherokee Supplement U+AB70 U+ABBF 80 Meetei Mayek U+ABC0 U+ABFF 56 Hangul Syllables U+AC00 U+D7AF 2 Hangul Jamo Extended-B U+D7B0 U+D7FF 72 High Surrogates U+D800 U+DB7F 2 High Private Use Surrogates U+DB80 U+DBFF 2 Low Surrogates U+DC00 U+DFFF 2 Private Use Area U+E000 U+F8FF 2 CJK Compatibility Ideographs U+F900 U+FAFF 472 Alphabetic Presentation Forms U+FB00 U+FB4F 58 Arabic Presentation Forms-A U+FB50 U+FDFF 643 Variation Selectors U+FE00 U+FE0F 16 Vertical Forms U+FE10 U+FE1F 10 Combining Half Marks U+FE20 U+FE2F 16 CJK Compatibility Forms U+FE30 U+FE4F 32 Small Form Variants U+FE50 U+FE6F 26 Arabic Presentation Forms-B U+FE70 U+FEFF 141 Halfwidth and Fullwidth Forms U+FF00 U+FFEF 225 Specials U+FFF0 U+FFFF 7 Linear B Syllabary U+10000 U+1007F 88 Linear B Ideograms U+10080 U+100FF 123 Aegean Numbers U+10100 U+1013F 57 Ancient Greek Numbers U+10140 U+1018F 77 Ancient Symbols U+10190 U+101CF 13 Phaistos Disc U+101D0 U+101FF 46 Lycian U+10280 U+1029F 29 Carian U+102A0 U+102DF 49 Coptic Epact Numbers U+102E0 U+102FF 28 Old Italic U+10300 U+1032F 36 Gothic U+10330 U+1034F 27 Old Permic U+10350 U+1037F 43 Ugaritic U+10380 U+1039F 31 Old Persian U+103A0 U+103DF 50 Deseret U+10400 U+1044F 80 Shavian U+10450 U+1047F 48 Osmanya U+10480 U+104AF 40 Elbasan U+10500 U+1052F 40 Caucasian Albanian U+10530 U+1056F 53 Linear A U+10600 U+1077F 341 Cypriot Syllabary U+10800 U+1083F 55 Imperial Aramaic U+10840 U+1085F 31 Palmyrene U+10860 U+1087F 32 Nabataean U+10880 U+108AF 40 Hatran U+108E0 U+108FF 26 Phoenician U+10900 U+1091F 29 Lydian U+10920 U+1093F 27 Meroitic Hieroglyphs U+10980 U+1099F 32 Meroitic Cursive U+109A0 U+109FF 90 Kharoshthi U+10A00 U+10A5F 65 Old South Arabian U+10A60 U+10A7F 32 Old North Arabian U+10A80 U+10A9F 32 Manichaean U+10AC0 U+10AFF 51 Avestan U+10B00 U+10B3F 61 Inscriptional Parthian U+10B40 U+10B5F 30 Inscriptional Pahlavi U+10B60 U+10B7F 27 Psalter Pahlavi U+10B80 U+10BAF 29 Old Turkic U+10C00 U+10C4F 73 Old Hungarian U+10C80 U+10CFF 108 Rumi Numeral Symbols U+10E60 U+10E7F 31 Brahmi U+11000 U+1107F 109 Kaithi U+11080 U+110CF 66 Sora Sompeng U+110D0 U+110FF 35 Chakma U+11100 U+1114F 67 Mahajani U+11150 U+1117F 39 Sharada U+11180 U+111DF 94 Sinhala Archaic Numbers U+111E0 U+111FF 20 Khojki U+11200 U+1124F 61 Multani U+11280 U+112AF 38 Khudawadi U+112B0 U+112FF 69 Grantha U+11300 U+1137F 85 Tirhuta U+11480 U+114DF 82 Siddham U+11580 U+115FF 92 Modi U+11600 U+1165F 79 Takri U+11680 U+116CF 66 Ahom U+11700 U+1173F 57 Warang Citi U+118A0 U+118FF 84 Pau Cin Hau U+11AC0 U+11AFF 57 Cuneiform U+12000 U+123FF 922 Cuneiform Numbers and Punctuation U+12400 U+1247F 116 Early Dynastic Cuneiform U+12480 U+1254F 196 Egyptian Hieroglyphs U+13000 U+1342F 1071 Anatolian Hieroglyphs U+14400 U+1467F 583 Bamum Supplement U+16800 U+16A3F 569 Mro U+16A40 U+16A6F 43 Bassa Vah U+16AD0 U+16AFF 36 Pahawh Hmong U+16B00 U+16B8F 127 Miao U+16F00 U+16F9F 133 Kana Supplement U+1B000 U+1B0FF 2 Duployan U+1BC00 U+1BC9F 143 Shorthand Format Controls U+1BCA0 U+1BCAF 4 Byzantine Musical Symbols U+1D000 U+1D0FF 246 Musical Symbols U+1D100 U+1D1FF 231 Ancient Greek Musical Notation U+1D200 U+1D24F 70 Tai Xuan Jing Symbols U+1D300 U+1D35F 87 Counting Rod Numerals U+1D360 U+1D37F 18 Mathematical Alphanumeric Symbols U+1D400 U+1D7FF 996 Sutton SignWriting U+1D800 U+1DAAF 672 Mende Kikakui U+1E800 U+1E8DF 213 Arabic Mathematical Alphabetic Symbols U+1EE00 U+1EEFF 143 Mahjong Tiles U+1F000 U+1F02F 44 Domino Tiles U+1F030 U+1F09F 100 Playing Cards U+1F0A0 U+1F0FF 82 Enclosed Alphanumeric Supplement U+1F100 U+1F1FF 173 Enclosed Ideographic Supplement U+1F200 U+1F2FF 57 Miscellaneous Symbols and Pictographs U+1F300 U+1F5FF 766 Emoticons U+1F600 U+1F64F 80 Ornamental Dingbats U+1F650 U+1F67F 48 Transport and Map Symbols U+1F680 U+1F6FF 98 Alchemical Symbols U+1F700 U+1F77F 116 Geometric Shapes Extended U+1F780 U+1F7FF 85 Supplemental Arrows-C U+1F800 U+1F8FF 148 Supplemental Symbols and Pictographs U+1F900 U+1F9FF 15 CJK Unified Ideographs Extension B U+20000 U+2A6DF 42676 CJK Unified Ideographs Extension C U+2A700 U+2B73F 60 CJK Unified Ideographs Extension D U+2B740 U+2B81F 27 CJK Unified Ideographs Extension E U+2B820 U+2CEAF 2 CJK Compatibility Ideographs Supplement U+2F800 U+2FA1F 542 Tags U+E0000 U+E007F 97 Variation Selectors Supplement U+E0100 U+E01EF 240 Supplementary Private Use Area-A U+F0000 U+FFFFF 4 Supplementary Private Use Area-B U+100000 U+10FFFF 4Principles of the Unicode Standard
The Unicode Standard set forth the following fundamental principles :
- Universal repertoire - Every writing system ever used shall be respected and represented in the standard
- Logical order - In bidirectional text are the characters stored in logical order, not in a way that the representaion
- Efficiency - The documentation must be efficient and complete.
- Unification - Where different cultures or languages use the same character, it shall be only included once. This point is
- Characters, not glyphs - Only characters, not glyphs shall be encoded. In a nutshell, glyphs are the actual graphical
- Dynamic composition - New characters can be composed of other, already standardized characters. For example, the character “Ä” can be composed of an “A” and a dieresis sign (“ ¨ ”).
- Semantics - Included characters must be well defined and distinguished from others.
- Stability - Once defined characters shall never be removed or their codepoints reassigned. In the case of an error, a codepoint shall be deprecated.
- Plain Text - Characters in the standard are text and never mark-up or metacharacters.
- Convertibility - Every other used encoding shall be representable in terms of a Unicode encoding.
Source: Principle descriptions are from codepoints.net
Unicode Versions
- Version 11.0 (Draft)
- Version 10.0 (Latest Version, Published June 20th, 2017)
- Version 9.0
- Version 8.0
- Version 7.0
- Version 6.3
- Version 6.2
- Version 6.1
- Version 6.0
- Version 5.2
- Version 5.1
- Version 5.0 (unavailable)
- Version 4.0.1
- Version 4.0
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK