14

Unicode Is Awesome

 4 years ago
source link: https://wisdom.engineering/awesome-unicode/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A curated list of delightful Unicode tidbits, packages and resources.

Foreword

Unicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts - some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human - and ultimately underappreciated.

Contents

  • Quick Unicode Background
    • What Characters Does the Unicode Standard Include?
    • Unicode Character Encodings
    • UTF-16 Surrogate Pairs
    • Calculating Surrogate Pairs
    • Composing & Decomposing
    • Applied Unicode Encodings
  • Awesome Characters List
    • Variable identifiers can effectively include whitespace!
    • Uppercase Transformation Collisions
    • Lowercase Transformation Collisions
  • Quirks and Troubleshooting
    • One-To-Many Case Mappings
  • Awesome Packages & Libraries
  • Creatively Naming Variables and Methods
    • Recursive HTML Tag Renaming Script
  • Exploring Deeper into Unicode Yourself
    • A map of the Basic Multilingual Plane
  • Principles of the Unicode Standard

Quick Unicode Background

What Characters Does the Unicode Standard Include?

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world's alphabets, ideograph sets, and symbol collections.

The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.

The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

Unicode Character Encodings

Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

Lets talk Numbers

The Unicode characterset is divided into 17 core segments called "planes", which are further divided into blocks. Each plane has space for 65,536 (2¹⁶) codepoints, supporting a grand total of 1,114,112 codepoints. There are two "Private Use Area" planes (#16 & #17) that are allocated to be used however one wishes. These two Private Use planes account for 131,072 codepoints.

# Name Range 1. Basic Multilingual Plane (U+0000 to U+FFFF) 2. Supplementary Multilingual Plane (U+10000 to U+1FFFF) 3. Supplementary Ideographic Plane (U+20000 to U+2FFFF) 4. Tertiary Ideographic Plane (U+30000 to U+3FFFF) 5. Plane 5 (unassigned) (U+40000 to U+4FFFF) 6. Plane 6 (unassigned) (U+50000 to U+5FFFF) 7. Plane 7 (unassigned) (U+60000 to U+6FFFF) 8. Plane 8 (unassigned) (U+70000 to U+7FFFF) 9. Plane 9 (unassigned) (U+80000 to U+8FFFF) 10. Plane 10 (unassigned) (U+90000 to U+9FFFF) 11. Plane 11 (unassigned) (U+A0000 to U+AFFFF) 12. Plane 12 (unassigned) (U+B0000 to U+BFFFF) 13. Plane 13 (unassigned) (U+C0000 to U+CFFFF) 14. Plane 14 (unassigned) (U+D0000 to U+DFFFF) 15. Supplementary Special-purpose Plane (U+E0000 to U+EFFFF) 16. Supplementary Private Use Area - A (U+F0000 to U+FFFFF) 17. Supplementary Private Use Area - B (U+100000 to U+10FFFF)

The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters. The other sixteen planes (U+010000 → U+10FFFF) are called supplementary planes or astral planes.

UTF-16 Surrogate Pairs

Characters outside the BMP, e.g. U+1D306 tetragram for centre ( ), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.

The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.

The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

Mathias Bynens

Surrogate pair: A representation for a single abstract character that consists of a

sequence of two 16-bit code units, where the first value of the pair is a high-surrogate

code unit and the second value is a low-surrogate code unit. Surrogate pairs are used only in UTF-16.

Unicode 8.0 Chapter 3.9 - Surrogates (See Unicode Encoding)

Calculating Surrogate Pairs

The Unicode character :hankey: Pile of Poo (U+1F4A9) in UTF-16 must be encoded as a surrogate pair, i.e. two surrogates. To convert any code point to a surrogate pair, use the following algorithm (in JavaScript). Keep in mind that we're using hexidecimal notation.

var High_Surrogate = function(Code_Point){ return Math.floor((Code_Point - 0x10000) / 0x400) + 0xD800 };
 var Low_Surrogate  = function(Code_Point){ return (Code_Point - 0x10000) % 0x400 + 0xDC00 };

 // Reverses The Conversion
 var Code_Point = function(High_Surrogate, Low_Surrogate){
	return (High_Surrogate - 0xD800) * 0x400 + Low_Surrogate - 0xDC00 + 0x10000;
 };
> var codepoint = 0x1F4A9;   					// 0x1F4A9 == 128169
 > High_Surrogate(codepoint).toString(16)
 "d83d"  										// 0xD83D == 55357
 > Low_Surrogate(codepoint).toString(16)
 "dca9"  										// 0xDCA9 == 56489

 > String.fromCharCode(  High_Surrogate(codepoint) , Low_Surrogate(codepoint) );
  ":hankey:"
> String.fromCodePoint(0x1F4A9)
  ":hankey:"
 > '\ud83d\udca9'
  ":hankey:"

Composing & Decomposing

Unicode includes a mechanism for modifying character shape that greatly extends the supported glyph repertoire. This covers the use of combining diacritical marks. They are inserted after the main character. Multiple combining diacritics may be stacked over the same character. Unicode also contains precomposed versions of most letter/diacritic combinations in normal use.

Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".

Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character "ü" may be decomposed into a "u" followed by the non-spacing character "¨". Once the character has been decomposed, it may be easier for the collation to work with the character because it can be processed as a "u" with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. It also defines normalization forms to provide for unique representations of characters.

Myths of Unicode

From Mark Davis's Unicode Myths slides.

  • Unicode is simply a 16-bit code- Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

  • You can use any unassigned codepoint for internal use- No. Eventually that hole will be filled with a different character. Instead use private use or noncharacters.

  • Every Unicode code point represents a character- No. There are lots of nonCharacters (FFFE, FFFF, 1FFFE,…)

    There are also surrogate code points, private and unassigned codepoints, and control/format “characters" (RLM, ZWNJ,…)

  • Unicode will run out of space- If it were linear, we would run out in 2140 AD. But it isn't linear. See https://www.unicode.org/roadmaps/

  • Case mappings are 1-1- No. They can also be:

    • One-to-many: (ß → SS )
    • Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
    • Locale-sensitive: ( I ↔ ı AND İ ↔ i )

Applied Unicode Encodings

Encoding Type Raw Encoding HTML Entity (Decimal) :vulcan_salute: HTML Entity (Hexadecimal) :vulcan_salute: URL Escape Code %F0%9F%96%96 UTF-8 (hex) 0xF0 0x9F 0x96 0x96 (f09f9696) UTF-8 (binary) 11110000:10011111:10010110:10010110 UTF-16/UTF-16BE (hex) 0xD83D 0xDD96 (d83ddd96) UTF-16LE (hex) 0x3DD8 0x96DD (3dd896dd) UTF-32/UTF-32BE (hex) 0x0001F596 (0001f596) UTF-32LE (hex) 0x96F50100 (96f50100) Octal Escape Sequence \360\237\226\226

Source Code

Encoding Type Raw Encoding JavaScript \u1F596 JSON \u1F596 C \u1F596 C++ \u1F596 Java \u1F596 Python \u1F596 Perl \x{1F596} Ruby \u{1F596} CSS \01F596

Awesome Characters List

JfuyErU.png!web

Special Characters

The Unicode Consortium published a general punctuation chart where you can find more details.

Char Name Description '' U+FEFF (Byte Order Mark - BOM) has the important property of unambiguity on byte reorder. It is also zerowidth, and invisible. In non-complying software (like the PHP interpreter) this leads to all sorts of fun behaviour. '￯' '\uFFEF' Reversed Byte Order Mark (BOM) does not equate to a legal character, other than the beginning of text. '​' '\u200B' zero-width non-break space (a character with no appearance and no effect other than preventing the formation of ligatures). ' ' U+00A0 NO-BREAK SPACE force adjacent characters to stick together. Well known as   in HTML. '­' U+00AD SOFT HYPHEN (in HTML: ­) like ZERO WIDTH SPACE, but show a hyphen if (and only if) a break occurs. '‍' U+200D ZERO WIDTH JOINER force adjacent characters to be joined together (e.g., arabic characters or supported emoji). Can be used this to compose sequentially combined emoji. '⁠' U+2060 WORD JOINER the same as U+00A0, but completely invisible. Good for writing @font-face on Twitter. ' ' U+1680 OGHAM SPACE MARK a space that looks like a dash. Great to bring programmers close to madness: 1 + 2 === 3. ';' U+037E GREEK QUESTION MARK a look-alike to the semicolon. Also a fun way to annoy developers. '‭' U+202D change the text direction to Left-to-Right. '‮' ‭ ‭ U+202E change the text direction to Right-to-Left: 'ꓸ' U+A4F8 LISU LETTER TONE MYA TI A lookalike for the period character. 'ꓹ' U+A4F9 LISU LETTER TONE NA PO A lookalike for the comma character. 'ꓼ' U+A4FC LISU LETTER TONE MYA NA A lookalike for the semi-colon character. 'ꓽ' U+A4FD LISU LETTER TONE MYA JEU A lookalike for the colon character. '︀' Variation Selectors ( U+FE00 to U+FE0F & U+E0100 to U+E01EF ) a block of 256 zero width characters that posess the ID_Continue proprerty- meaning they can be used in variable names (not the first letter). What makes these special is the fact that mouse cursors pass over them as they are combining characters - unlike most other zero width characters. 'ᅟ' U+115F HANGUL CHOSEONG FILLER In general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start 'ᅠ' U+1160 HANGUL JUNGSEONG FILLER Perhaps it produces a space? Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start 'ㅤ' U+3164 HANGUL FILLER In general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start

Wait a second... what did I just read?

Variable identifiers can effectively include whitespace!

The U+3164 HANGUL FILLER character displays as an advancing whitespace character. The character is rendered as completely invisible (and non advancing, i.e. "zero width"), if not explicitly supported in rendering . That means the ugly character replacement (�) symbol should never be displayed.

I'm not yet sure why U+3164 was specified to behave this way. Interestingly, U+3164 was added to Unicode in version 1.1 (1993)- so the consortium must have had a lot of time to think it through. Anyway, here are a few examples.

> var ᅟ = 'foo';
undefined
> ᅟ
'foo'


> var ㅤ= alert;
undefined
> var foo = 'bar'
undefined
> if ( foo ===ㅤ`baz` ){} 	// alert
undefined


> var varㅤfooㅤ\u{A60C}ㅤπ = 'bar';
undefined
> varㅤfooㅤꘌㅤπ
'bar'

**NOTE:** I've tested U+3164 rendering on Ubuntu and OS X with the following: `node`, `php`, `ruby`, `python3.5`, `scala` ,`vim`, `cat`, `chrome`+`github gist`. Atom is the only system that fails by (incorrectly) displaying empty boxes. I have yet to test it out on Emacs and Sublime. From what I understand, the Unicode Consortium will not reassign or rename characters or codepoints, but may be convinced to change character properties like ID_Start/ID_Continue.

Modifiers

The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

The zero-width non-joiner (ZWNJ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively. This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the words closer together or to connect a word with its morpheme.

> 'a'
 "a"

> 'a\u{0308}'
 "ä"

> 'a\u{20DE}\u{0308}'
 "a⃞̈"

> 'a\u{20DE}\u{0308}\u{20DD}'
 "a⃞̈⃝"

// Modifying Invisible Characters
> '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}'
 "‎‎‎‎‎‎‎‎‎‎"

> '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}'.length
 10

Uppercase Transformation Collisions

Char Code Point Output Char ß 0x00DF SS ı 0x0131 I ſ 0x017F S ff 0xFB00 FF fi 0xFB01 FI fl 0xFB02 FL ffi 0xFB03 FFI ffl 0xFB04 FFL ſt 0xFB05 ST st 0xFB06 ST

Lowercase Transformation Collisions

Char Code Point Output Char K 0x212A k

Quirks and Troubleshooting

  • String length is typically determined by counting codepoints.This means that surrogate pairs would count as two characters. Combining multiple diacritics may be stacked over the same character. a + ̈ == ̈a , increasing length, while only producing a single character.

  • Similarily, reversing strings often is a non-trivial task.Again, surrogate pairs and diacritics must be reversed together. ES Reverser provides a pretty good solution.

  • Upper and lower case mappings are not always one-to-one.They can also be:

    • One-to-many: (ß → SS )
    • Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
    • Locale-sensitive: ( I ↔ ı AND İ ↔ i )

One-To-Many Case Mappings

Most of the below characters express their one-to-many case mappings when uppercased- while others should be lowercased. This list should be split up

Code Point Character Name Mapped Character Mapped Code Points U+00DF ß LATIN SMALL LETTER SHARP S s , s U+0073, U+0073 U+0130 İ LATIN CAPITAL LETTER I WITH DOT ABOVE i , ̇ U+0069, U+0307 U+0149 ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE ʼ , n U+02BC, U+006E U+01F0 ǰ LATIN SMALL LETTER J WITH CARON j , ̌ U+006A, U+030C U+0390 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS ι , ̈ , ́ U+03B9, U+0308, U+0301 U+03B0 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS υ , ̈ , ́ U+03C5, U+0308, U+0301 U+0587 և ARMENIAN SMALL LIGATURE ECH YIWN ե , ւ U+0565, U+0582 U+1E96 LATIN SMALL LETTER H WITH LINE BELOW h , ̱ U+0068, U+0331 U+1E97 LATIN SMALL LETTER T WITH DIAERESIS t , ̈ U+0074, U+0308 U+1E98 LATIN SMALL LETTER W WITH RING ABOVE w , ̊ U+0077, U+030A U+1E99 LATIN SMALL LETTER Y WITH RING ABOVE y , ̊ U+0079, U+030A U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING a , ʾ U+0061, U+02BE U+1E9E LATIN CAPITAL LETTER SHARP S s , s U+0073, U+0073 U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI υ , ̓ U+03C5, U+0313 U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA υ , ̓ , ̀ U+03C5, U+0313, U+0300 U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA υ , ̓ , ́ U+03C5, U+0313, U+0301 U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI υ , ̓ , ͂ U+03C5, U+0313, U+0342 U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI , ι U+1F00, U+03B9 U+1F81 GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI , ι U+1F01, U+03B9 U+1F82 GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI , ι U+1F02, U+03B9 U+1F83 GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI , ι U+1F03, U+03B9 U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI , ι U+1F04, U+03B9 U+1F85 GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI , ι U+1F05, U+03B9 U+1F86 GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F06, U+03B9 U+1F87 GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F07, U+03B9 U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI , ι U+1F00, U+03B9 U+1F89 GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI , ι U+1F01, U+03B9 U+1F8A GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI , ι U+1F02, U+03B9 U+1F8B GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI , ι U+1F03, U+03B9 U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI , ι U+1F04, U+03B9 U+1F8D GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI , ι U+1F05, U+03B9 U+1F8E GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F06, U+03B9 U+1F8F GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F07, U+03B9 U+1F90 GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI , ι U+1F20, U+03B9 U+1F91 GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI , ι U+1F21, U+03B9 U+1F92 GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI , ι U+1F22, U+03B9 U+1F93 GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI , ι U+1F23, U+03B9 U+1F94 GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI , ι U+1F24, U+03B9 U+1F95 GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI , ι U+1F25, U+03B9 U+1F96 GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F26, U+03B9 U+1F97 GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F27, U+03B9 U+1F98 GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI , ι U+1F20, U+03B9 U+1F99 GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI , ι U+1F21, U+03B9 U+1F9A GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI , ι U+1F22, U+03B9 U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI , ι U+1F23, U+03B9 U+1F9C GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI , ι U+1F24, U+03B9 U+1F9D GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI , ι U+1F25, U+03B9 U+1F9E GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F26, U+03B9 U+1F9F GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F27, U+03B9 U+1FA0 GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI , ι U+1F60, U+03B9 U+1FA1 GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI , ι U+1F61, U+03B9 U+1FA2 GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI , ι U+1F62, U+03B9 U+1FA3 GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI , ι U+1F63, U+03B9 U+1FA4 GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI , ι U+1F64, U+03B9 U+1FA5 GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI , ι U+1F65, U+03B9 U+1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F66, U+03B9 U+1FA7 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI , ι U+1F67, U+03B9 U+1FA8 GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI , ι U+1F60, U+03B9 U+1FA9 GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI , ι U+1F61, U+03B9 U+1FAA GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI , ι U+1F62, U+03B9 U+1FAB GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI , ι U+1F63, U+03B9 U+1FAC GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI , ι U+1F64, U+03B9 U+1FAD GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI , ι U+1F65, U+03B9 U+1FAE GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F66, U+03B9 U+1FAF GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI , ι U+1F67, U+03B9 U+1FB2 GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI , ι U+1F70, U+03B9 U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI α , ι U+03B1, U+03B9 U+1FB4 GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI ά , ι U+03AC, U+03B9 U+1FB6 GREEK SMALL LETTER ALPHA WITH PERISPOMENI α , ͂ U+03B1, U+0342 U+1FB7 GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI α , ͂ , ι U+03B1, U+0342, U+03B9 U+1FBC GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI α , ι U+03B1, U+03B9 U+1FC2 GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI , ι U+1F74, U+03B9 U+1FC3 GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI η , ι U+03B7, U+03B9 U+1FC4 GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI ή , ι U+03AE, U+03B9 U+1FC6 GREEK SMALL LETTER ETA WITH PERISPOMENI η , ͂ U+03B7, U+0342 U+1FC7 GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI η , ͂ , ι U+03B7, U+0342, U+03B9 U+1FCC GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI η , ι U+03B7, U+03B9 U+1FD2 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA ι , ̈ , ̀ U+03B9, U+0308, U+0300 U+1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA ι , ̈ , ́ U+03B9, U+0308, U+0301 U+1FD6 GREEK SMALL LETTER IOTA WITH PERISPOMENI ι , ͂ U+03B9, U+0342 U+1FD7 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI ι , ̈ , ͂ U+03B9, U+0308, U+0342 U+1FE2 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA υ , ̈ , ̀ U+03C5, U+0308, U+0300 U+1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA υ , ̈ , ́ U+03C5, U+0308, U+0301 U+1FE4 GREEK SMALL LETTER RHO WITH PSILI ρ , ̓ U+03C1, U+0313 U+1FE6 GREEK SMALL LETTER UPSILON WITH PERISPOMENI υ , ͂ U+03C5, U+0342 U+1FE7 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI υ , ̈ , ͂ U+03C5, U+0308, U+0342 U+1FF2 GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI , ι U+1F7C, U+03B9 U+1FF3 GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI ω , ι U+03C9, U+03B9 U+1FF4 GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI ώ , ι U+03CE, U+03B9 U+1FF6 GREEK SMALL LETTER OMEGA WITH PERISPOMENI ω , ͂ U+03C9, U+0342 U+1FF7 GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI ω , ͂ , ι U+03C9, U+0342, U+03B9 U+1FFC GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI ω , ι U+03C9, U+03B9 U+FB00 LATIN SMALL LIGATURE FF f , f U+0066, U+0066 U+FB01 LATIN SMALL LIGATURE FI f , i U+0066, U+0069 U+FB02 LATIN SMALL LIGATURE FL f , l U+0066, U+006C U+FB03 LATIN SMALL LIGATURE FFI f , f , i U+0066, U+0066, U+0069 U+FB04 LATIN SMALL LIGATURE FFL f , f , l U+0066, U+0066, U+006C U+FB05 LATIN SMALL LIGATURE LONG S T s , t U+0073, U+0074 U+FB06 LATIN SMALL LIGATURE ST s , t U+0073, U+0074 U+FB13 ARMENIAN SMALL LIGATURE MEN NOW մ , ն U+0574, U+0576 U+FB14 ARMENIAN SMALL LIGATURE MEN ECH մ , ե U+0574, U+0565 U+FB15 ARMENIAN SMALL LIGATURE MEN INI մ , ի U+0574, U+056B U+FB16 ARMENIAN SMALL LIGATURE VEW NOW վ , ն U+057E, U+0576 U+FB17 ARMENIAN SMALL LIGATURE MEN XEH մ , խ U+0574, U+056D

Awesome Packages & Libraries

  • PhantomScript - :ghost: :flashlight: Invisible JavaScript code execution & social engineering
  • ESReverser - A Unicode-aware string reverser written in JavaScript.
  • mimic - [ab]using Unicode to create tragedy
  • python-ftfy - Given Unicode text, make its representation consistent and possibly less broken.
  • vim-troll-stopper - Stop Unicode trolls from messing with your code.

Emojis

Diversity

The Unicode Consortium has made a huge effort better reflect and incorporate human diversity, including cultural practices. Here is the Consortium's diversity report .

Emojis of mixed gender situations are now available, such as same sex families, holding hands, and kissing. The real kicker are Emoji combined sequences . Basically:

Code Points Recipe Combined U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469 nIfYZzu.png!webyMzYJvr.png!webIVryUrM.png!webyMzYJvr.png!webnIfYZzu.png!webapple_1f469_200d_2764_fe0f_200d_1f469.png U+1F468 U+200D U+1F468 U+200D U+1F467 U+200D U+1F466 eAJzeaE.jpg!webJvyM7bi.png!web

Further, emojis now support skin color modifiers.

Five symbol modifier characters that provide for a range of skin tones for human emoji were released in Unicode Version 8.0 (mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations.

Unicode Consortium's Diversity report

Code Name Samples U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 j2eM7fu.png!webu2muMfm.png!web U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 RnEjuyj.png!web6BZBnaB.png!web U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 aIrQjuB.png!webvqIza26.png!web U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 YrMfay6.png!web26B7v2Q.png!web U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 MzQN3qz.png!webyIri22b.png!web

Just follow the desired Emoji with one of the skin color modifiers \u{1F466}\u{1F3FE} .

7JVRjq7.png!webMzmeYbR.png!web  →  nqI3Enm.png!web

ZzQfYfe.png!web

Creatively Naming Variables and Methods

Examples are written in JavaScript (ES6)

In general, characters designated the ID_START property may be used at the beggining of a variable name. Characters designated with the ID_CONTINUE property may be used after the first character of a variable.

function rand(μ,σ){ ... };

String.prototype.reverseⵑ = function(){..};

Number.prototype.isTrueɁ = function(){..};

var WhatDoesThisDoɁɁɁɁ = 42

Here are some really creative variable names from Mathias Bynes

// How convenient!
var π = Math.PI;

// Sometimes, you just have to use the Bad Parts of JavaScript:
var ಠ_ಠ = eval;

// Code, Y U NO WORK?!
var ლ_ಠ益ಠ_ლ = 42;

// How about a JavaScript library for functional programming?
var λ = function() {};

// Obfuscate boring variable names for great justice
var \u006C\u006F\u006C\u0077\u0061\u0074 = 'heh';

// …or just make up random ones
var Ꙭൽↈⴱ = 'huh';

// While perfectly valid, this doesn’t work in most browsers:
var foo\u200Cbar = 42;

// This is *not* a bitwise left shift (`<<`):
var 〱〱 = 2;
// This is, though:
〱〱 << 〱〱; // 8

// Give yourself a discount:
var price_9̶9̶_89 = 'cheap';

// Fun with Roman numerals
var Ⅳ = 4;
var Ⅴ = 5;
Ⅳ + Ⅴ; // 9

// Cthulhu was here
var Hͫ̆̒̐ͣ̊̄ͯ͗͏̵̗̻̰̠̬͝ͅE̴̷̬͎̱̘͇͍̾ͦ͊͒͊̓̓̐_̫̠̱̩̭̤͈̑̎̋ͮͩ̒͑̾͋͘Ç̳͕̯̭̱̲̣̠̜͋̍O̴̦̗̯̹̼ͭ̐ͨ̊̈͘͠M̶̝̠̭̭̤̻͓͑̓̊ͣͤ̎͟͠E̢̞̮̹͍̞̳̣ͣͪ͐̈T̡̯̳̭̜̠͕͌̈́̽̿ͤ̿̅̑Ḧ̱̱̺̰̳̹̘̰́̏ͪ̂̽͂̀͠ = 'Zalgo';

And here's some Unicode CSS Classes from David Walsh

<!-- place this within the document head -->
<meta charset="UTF-8" />

<!-- error message -->
<div class="ಠ_ಠ">You do not have access to this page.</div>

<!-- success message -->
<div class="❤">Your changes have been saved successfully!</div>
.ಠ_ಠ {
	border: 1px solid #f00;
}

.❤ {
	background: lightgreen;
}

Recursive HTML Tag Renaming Script

If you want to rename all your HTML tags to what appears as nothing, the following script is just what your looking for.

Do note however that HTML does not support all unicode characters.

// U+1160 HANGUL JUNGSEONG FILLER
transformAllTags('ᅠ');

// An actual HTML element node designed to look like a comment node, using the U+01C3 LATIN LETTER RETROFLEX CLICK 
//	<ǃ-- name="viewport" content="width=device-width"></ǃ-->
transformAllTags('ǃ--');

// or even <ᅠ⃝
transformAllTags('\u{1160}\u{20dd}');

// and for a bonus, all existing tag names will have each character ensquared. h⃞t⃞m⃞l⃞
transformAllTags();


function transformAllTags (newName){
   // querySelectorAll doesn't actually return an array.
   Array.from(document.querySelectorAll('*'))
     .forEach(function(x){
         transformTag(x, newName);
   });
}

function wonky(str){
  return str.split('').join('\u{20de}') + '\u{20de}';
}

function transformTag(tagIdOrElem, tagType){
    var elem = (tagIdOrElem instanceof HTMLElement) ? tagIdOrElem : document.getElementById(tagIdOrElem);
    if(!elem || !(elem instanceof HTMLElement))return;
    var children = elem.childNodes;
    var parent = elem.parentNode;
    var newNode = document.createElement(tagType||wonky(elem.tagName));
    for(var a=0;a<elem.attributes.length;a++){
        newNode.setAttribute(elem.attributes[a].nodeName, elem.attributes[a].value);
    }
    for(var i= 0,clen=children.length;i<clen;i++){
        newNode.appendChild(children[0]); //0...always point to the first non-moved element
    }
    newNode.style.cssText = elem.style.cssText;
    parent.replaceChild(newNode,elem);
}

Here is what it does support:

function testBegin(str){
 try{
    eval(`document.createElement( '${str}' );`)
    return true;
 }
 catch(e){ return false; }
}

function testContinue(str){
 try{
    eval(`document.createElement( 'a${str}' );`)
    return true;
 }
 catch(e){ return false; }
}

And heres some basic results

// Test if dashes can start an HTML Tag
> testBegin('-')
< false

> testContinue('-')
< true

> testBegin('ᅠ-')	// Prepend dash with U+1160 HANGUL JUNGSEONG FILLER
< true

Unicode Fonts

A single TrueType / OpenType font format cannot cover all UTF-8 characters as there is a hard limit of 65535 glyphs in a font. Since there are over 1.1 million UTF-8 glphys, you will need to use a font-family to cover them all.

More Reading

Exploring Deeper into Unicode Yourself

Overview Map

A map of the Basic Multilingual Plane

Each numbered box represents 256 code points.

UzY36nN.png!web

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named "CJK Unified Ideographs".

Unicode Blocks

The Unicode standard arranges groups of characters together in blocks. This is the complete list of blocks across all 17 planes.

Name From To # Codepoints Basic Latin U+0000 U+007F 128 Latin-1 Supplement U+0080 U+00FF 128 Latin Extended-A U+0100 U+017F 128 Latin Extended-B U+0180 U+024F 208 IPA Extensions U+0250 U+02AF 96 Spacing Modifier Letters U+02B0 U+02FF 80 Combining Diacritical Marks U+0300 U+036F 112 Greek and Coptic U+0370 U+03FF 135 Cyrillic U+0400 U+04FF 256 Cyrillic Supplement U+0500 U+052F 48 Armenian U+0530 U+058F 89 Hebrew U+0590 U+05FF 87 Arabic U+0600 U+06FF 255 Syriac U+0700 U+074F 77 Arabic Supplement U+0750 U+077F 48 Thaana U+0780 U+07BF 50 NKo U+07C0 U+07FF 59 Samaritan U+0800 U+083F 61 Mandaic U+0840 U+085F 29 Arabic Extended-A U+08A0 U+08FF 50 Devanagari U+0900 U+097F 128 Bengali U+0980 U+09FF 93 Gurmukhi U+0A00 U+0A7F 79 Gujarati U+0A80 U+0AFF 85 Oriya U+0B00 U+0B7F 90 Tamil U+0B80 U+0BFF 72 Telugu U+0C00 U+0C7F 96 Kannada U+0C80 U+0CFF 87 Malayalam U+0D00 U+0D7F 100 Sinhala U+0D80 U+0DFF 90 Thai U+0E00 U+0E7F 87 Lao U+0E80 U+0EFF 67 Tibetan U+0F00 U+0FFF 211 Myanmar U+1000 U+109F 160 Georgian U+10A0 U+10FF 88 Hangul Jamo U+1100 U+11FF 256 Ethiopic U+1200 U+137F 358 Ethiopic Supplement U+1380 U+139F 26 Cherokee U+13A0 U+13FF 92 Unified Canadian Aboriginal Syllabics U+1400 U+167F 640 Ogham U+1680 U+169F 29 Runic U+16A0 U+16FF 89 Tagalog U+1700 U+171F 20 Hanunoo U+1720 U+173F 23 Buhid U+1740 U+175F 20 Tagbanwa U+1760 U+177F 18 Khmer U+1780 U+17FF 114 Mongolian U+1800 U+18AF 156 Unified Canadian Aboriginal Syllabics Extended U+18B0 U+18FF 70 Limbu U+1900 U+194F 68 Tai Le U+1950 U+197F 35 New Tai Lue U+1980 U+19DF 83 Khmer Symbols U+19E0 U+19FF 32 Buginese U+1A00 U+1A1F 30 Tai Tham U+1A20 U+1AAF 127 Combining Diacritical Marks Extended U+1AB0 U+1AFF 15 Balinese U+1B00 U+1B7F 121 Sundanese U+1B80 U+1BBF 64 Batak U+1BC0 U+1BFF 56 Lepcha U+1C00 U+1C4F 74 Ol Chiki U+1C50 U+1C7F 48 Sundanese Supplement U+1CC0 U+1CCF 8 Vedic Extensions U+1CD0 U+1CFF 41 Phonetic Extensions U+1D00 U+1D7F 128 Phonetic Extensions Supplement U+1D80 U+1DBF 64 Combining Diacritical Marks Supplement U+1DC0 U+1DFF 58 Latin Extended Additional U+1E00 U+1EFF 256 Greek Extended U+1F00 U+1FFF 233 General Punctuation U+2000 U+206F 111 Superscripts and Subscripts U+2070 U+209F 42 Currency Symbols U+20A0 U+20CF 31 Combining Diacritical Marks for Symbols U+20D0 U+20FF 33 Letterlike Symbols U+2100 U+214F 80 Number Forms U+2150 U+218F 60 Arrows U+2190 U+21FF 112 Mathematical Operators U+2200 U+22FF 256 Miscellaneous Technical U+2300 U+23FF 251 Control Pictures U+2400 U+243F 39 Optical Character Recognition U+2440 U+245F 11 Enclosed Alphanumerics U+2460 U+24FF 160 Box Drawing U+2500 U+257F 128 Block Elements U+2580 U+259F 32 Geometric Shapes U+25A0 U+25FF 96 Miscellaneous Symbols U+2600 U+26FF 256 Dingbats U+2700 U+27BF 192 Miscellaneous Mathematical Symbols-A U+27C0 U+27EF 48 Supplemental Arrows-A U+27F0 U+27FF 16 Braille Patterns U+2800 U+28FF 256 Supplemental Arrows-B U+2900 U+297F 128 Miscellaneous Mathematical Symbols-B U+2980 U+29FF 128 Supplemental Mathematical Operators U+2A00 U+2AFF 256 Miscellaneous Symbols and Arrows U+2B00 U+2BFF 206 Glagolitic U+2C00 U+2C5F 94 Latin Extended-C U+2C60 U+2C7F 32 Coptic U+2C80 U+2CFF 123 Georgian Supplement U+2D00 U+2D2F 40 Tifinagh U+2D30 U+2D7F 59 Ethiopic Extended U+2D80 U+2DDF 79 Cyrillic Extended-A U+2DE0 U+2DFF 32 Supplemental Punctuation U+2E00 U+2E7F 67 CJK Radicals Supplement U+2E80 U+2EFF 115 Kangxi Radicals U+2F00 U+2FDF 214 Ideographic Description Characters U+2FF0 U+2FFF 12 CJK Symbols and Punctuation U+3000 U+303F 64 Hiragana U+3040 U+309F 93 Katakana U+30A0 U+30FF 96 Bopomofo U+3100 U+312F 41 Hangul Compatibility Jamo U+3130 U+318F 94 Kanbun U+3190 U+319F 16 Bopomofo Extended U+31A0 U+31BF 27 CJK Strokes U+31C0 U+31EF 36 Katakana Phonetic Extensions U+31F0 U+31FF 16 Enclosed CJK Letters and Months U+3200 U+32FF 254 CJK Compatibility U+3300 U+33FF 256 CJK Unified Ideographs Extension A U+3400 U+4DBF 6191 Yijing Hexagram Symbols U+4DC0 U+4DFF 64 CJK Unified Ideographs U+4E00 U+9FFF 20941 Yi Syllables U+A000 U+A48F 1165 Yi Radicals U+A490 U+A4CF 55 Lisu U+A4D0 U+A4FF 48 Vai U+A500 U+A63F 300 Cyrillic Extended-B U+A640 U+A69F 96 Bamum U+A6A0 U+A6FF 88 Modifier Tone Letters U+A700 U+A71F 32 Latin Extended-D U+A720 U+A7FF 159 Syloti Nagri U+A800 U+A82F 44 Common Indic Number Forms U+A830 U+A83F 10 Phags-pa U+A840 U+A87F 56 Saurashtra U+A880 U+A8DF 81 Devanagari Extended U+A8E0 U+A8FF 30 Kayah Li U+A900 U+A92F 48 Rejang U+A930 U+A95F 37 Hangul Jamo Extended-A U+A960 U+A97F 29 Javanese U+A980 U+A9DF 91 Myanmar Extended-B U+A9E0 U+A9FF 31 Cham U+AA00 U+AA5F 83 Myanmar Extended-A U+AA60 U+AA7F 32 Tai Viet U+AA80 U+AADF 72 Meetei Mayek Extensions U+AAE0 U+AAFF 23 Ethiopic Extended-A U+AB00 U+AB2F 32 Latin Extended-E U+AB30 U+AB6F 54 Cherokee Supplement U+AB70 U+ABBF 80 Meetei Mayek U+ABC0 U+ABFF 56 Hangul Syllables U+AC00 U+D7AF 2 Hangul Jamo Extended-B U+D7B0 U+D7FF 72 High Surrogates U+D800 U+DB7F 2 High Private Use Surrogates U+DB80 U+DBFF 2 Low Surrogates U+DC00 U+DFFF 2 Private Use Area U+E000 U+F8FF 2 CJK Compatibility Ideographs U+F900 U+FAFF 472 Alphabetic Presentation Forms U+FB00 U+FB4F 58 Arabic Presentation Forms-A U+FB50 U+FDFF 643 Variation Selectors U+FE00 U+FE0F 16 Vertical Forms U+FE10 U+FE1F 10 Combining Half Marks U+FE20 U+FE2F 16 CJK Compatibility Forms U+FE30 U+FE4F 32 Small Form Variants U+FE50 U+FE6F 26 Arabic Presentation Forms-B U+FE70 U+FEFF 141 Halfwidth and Fullwidth Forms U+FF00 U+FFEF 225 Specials U+FFF0 U+FFFF 7 Linear B Syllabary U+10000 U+1007F 88 Linear B Ideograms U+10080 U+100FF 123 Aegean Numbers U+10100 U+1013F 57 Ancient Greek Numbers U+10140 U+1018F 77 Ancient Symbols U+10190 U+101CF 13 Phaistos Disc U+101D0 U+101FF 46 Lycian U+10280 U+1029F 29 Carian U+102A0 U+102DF 49 Coptic Epact Numbers U+102E0 U+102FF 28 Old Italic U+10300 U+1032F 36 Gothic U+10330 U+1034F 27 Old Permic U+10350 U+1037F 43 Ugaritic U+10380 U+1039F 31 Old Persian U+103A0 U+103DF 50 Deseret U+10400 U+1044F 80 Shavian U+10450 U+1047F 48 Osmanya U+10480 U+104AF 40 Elbasan U+10500 U+1052F 40 Caucasian Albanian U+10530 U+1056F 53 Linear A U+10600 U+1077F 341 Cypriot Syllabary U+10800 U+1083F 55 Imperial Aramaic U+10840 U+1085F 31 Palmyrene U+10860 U+1087F 32 Nabataean U+10880 U+108AF 40 Hatran U+108E0 U+108FF 26 Phoenician U+10900 U+1091F 29 Lydian U+10920 U+1093F 27 Meroitic Hieroglyphs U+10980 U+1099F 32 Meroitic Cursive U+109A0 U+109FF 90 Kharoshthi U+10A00 U+10A5F 65 Old South Arabian U+10A60 U+10A7F 32 Old North Arabian U+10A80 U+10A9F 32 Manichaean U+10AC0 U+10AFF 51 Avestan U+10B00 U+10B3F 61 Inscriptional Parthian U+10B40 U+10B5F 30 Inscriptional Pahlavi U+10B60 U+10B7F 27 Psalter Pahlavi U+10B80 U+10BAF 29 Old Turkic U+10C00 U+10C4F 73 Old Hungarian U+10C80 U+10CFF 108 Rumi Numeral Symbols U+10E60 U+10E7F 31 Brahmi U+11000 U+1107F 109 Kaithi U+11080 U+110CF 66 Sora Sompeng U+110D0 U+110FF 35 Chakma U+11100 U+1114F 67 Mahajani U+11150 U+1117F 39 Sharada U+11180 U+111DF 94 Sinhala Archaic Numbers U+111E0 U+111FF 20 Khojki U+11200 U+1124F 61 Multani U+11280 U+112AF 38 Khudawadi U+112B0 U+112FF 69 Grantha U+11300 U+1137F 85 Tirhuta U+11480 U+114DF 82 Siddham U+11580 U+115FF 92 Modi U+11600 U+1165F 79 Takri U+11680 U+116CF 66 Ahom U+11700 U+1173F 57 Warang Citi U+118A0 U+118FF 84 Pau Cin Hau U+11AC0 U+11AFF 57 Cuneiform U+12000 U+123FF 922 Cuneiform Numbers and Punctuation U+12400 U+1247F 116 Early Dynastic Cuneiform U+12480 U+1254F 196 Egyptian Hieroglyphs U+13000 U+1342F 1071 Anatolian Hieroglyphs U+14400 U+1467F 583 Bamum Supplement U+16800 U+16A3F 569 Mro U+16A40 U+16A6F 43 Bassa Vah U+16AD0 U+16AFF 36 Pahawh Hmong U+16B00 U+16B8F 127 Miao U+16F00 U+16F9F 133 Kana Supplement U+1B000 U+1B0FF 2 Duployan U+1BC00 U+1BC9F 143 Shorthand Format Controls U+1BCA0 U+1BCAF 4 Byzantine Musical Symbols U+1D000 U+1D0FF 246 Musical Symbols U+1D100 U+1D1FF 231 Ancient Greek Musical Notation U+1D200 U+1D24F 70 Tai Xuan Jing Symbols U+1D300 U+1D35F 87 Counting Rod Numerals U+1D360 U+1D37F 18 Mathematical Alphanumeric Symbols U+1D400 U+1D7FF 996 Sutton SignWriting U+1D800 U+1DAAF 672 Mende Kikakui U+1E800 U+1E8DF 213 Arabic Mathematical Alphabetic Symbols U+1EE00 U+1EEFF 143 Mahjong Tiles U+1F000 U+1F02F 44 Domino Tiles U+1F030 U+1F09F 100 Playing Cards U+1F0A0 U+1F0FF 82 Enclosed Alphanumeric Supplement U+1F100 U+1F1FF 173 Enclosed Ideographic Supplement U+1F200 U+1F2FF 57 Miscellaneous Symbols and Pictographs U+1F300 U+1F5FF 766 Emoticons U+1F600 U+1F64F 80 Ornamental Dingbats U+1F650 U+1F67F 48 Transport and Map Symbols U+1F680 U+1F6FF 98 Alchemical Symbols U+1F700 U+1F77F 116 Geometric Shapes Extended U+1F780 U+1F7FF 85 Supplemental Arrows-C U+1F800 U+1F8FF 148 Supplemental Symbols and Pictographs U+1F900 U+1F9FF 15 CJK Unified Ideographs Extension B U+20000 U+2A6DF 42676 CJK Unified Ideographs Extension C U+2A700 U+2B73F 60 CJK Unified Ideographs Extension D U+2B740 U+2B81F 27 CJK Unified Ideographs Extension E U+2B820 U+2CEAF 2 CJK Compatibility Ideographs Supplement U+2F800 U+2FA1F 542 Tags U+E0000 U+E007F 97 Variation Selectors Supplement U+E0100 U+E01EF 240 Supplementary Private Use Area-A U+F0000 U+FFFFF 4 Supplementary Private Use Area-B U+100000 U+10FFFF 4

Principles of the Unicode Standard

The Unicode Standard set forth the following fundamental principles :

  • Universal repertoire - Every writing system ever used shall be respected and represented in the standard
  • Logical order - In bidirectional text are the characters stored in logical order, not in a way that the representaion
  • Efficiency - The documentation must be efficient and complete.
  • Unification - Where different cultures or languages use the same character, it shall be only included once. This point is
  • Characters, not glyphs - Only characters, not glyphs shall be encoded. In a nutshell, glyphs are the actual graphical
  • Dynamic composition - New characters can be composed of other, already standardized characters. For example, the character “Ä” can be composed of an “A” and a dieresis sign (“ ¨ ”).
  • Semantics - Included characters must be well defined and distinguished from others.
  • Stability - Once defined characters shall never be removed or their codepoints reassigned. In the case of an error, a codepoint shall be deprecated.
  • Plain Text - Characters in the standard are text and never mark-up or metacharacters.
  • Convertibility - Every other used encoding shall be representable in terms of a Unicode encoding.

Source: Principle descriptions are from codepoints.net

Unicode Versions


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK