还在用 charCodeAt 那你就 out 了

在 JavaScript 中处理中文和其他 Unicode 字符时，我们会用到处理 Unicode 相关的 API。

在早期，JavaScript 提供的 String.prototype.charCodeAt 和 String.fromCharCode 就是能够将字符串转换为 Unicode 的 UTF-16 编码以及从 UTF-16 编码转换为字符串的函数。

比如：

const str = '中文';

console.log([...str].map(char => char.charCodeAt(0)));
// [20013, 25991]

这里我们将字符串展开成单个字符，再通过 charCodeAt 方法将字符串转换为对应的 Unicode 编码，这里的 20013 和 25991 就是 “中文” 两个字对应的 Unicode 编码。

同样，我们可以使用 fromCharCode 将 Unicode 编码转换为字符串：

const charCodes = [20013, 25991];

console.log(String.fromCharCode(...charCodes)); // 中文

这两个方法相信大部分同学都不陌生，这是从 ES3 就开始支持的方法。但是，这个方法在今天我们处理 Unicode 字符时不够用了。

为什么呢？我们来看一下例子：

const str = ':mahjong:';

console.log(str.charCodeAt(0)); // 55356

这个字符是我们熟悉的麻将中的红中，现在很多输入法都能直接打出来，看上去似乎也正常，没什么问题啊？

可你再试试：

console.log(String.fromCharCode(55356)); // �

实际上 Unicode 字符:mahjong:的 UTF-16 编码并不是 55356，这时候如果你使用 charCodeAt 来得到字符:mahjong:的 UTF-16 编码，应该要到两个值：

const str = ':mahjong:';

console.log(str.charCodeAt(0), str.charCodeAt(1)); // 55356 56324

对应的 String.fromCharCode(55356, 56324) 才能还原:mahjong:字符。除此以外，还有其他一些不一样的地方，比如：

console.log(':mahjong:'.length); // 字符串长度为2
':mahjong:'.split(''); // ["�", "�"] split 出来两个字符
/^.$/.test(':mahjong:'); // false

复制代码

:point_right|type_1_2: 知识点 ：Unicode 标准中，将字符编码的码位以 2**16 个为一组，组成为一个平面（Plane），按照字符的码位值，分为 17 个平面，所有码位从 0x000000 到 0x10FFFF，总共使用 3 个字节。

其中最前面的 1 个字节是平面编号，从 0x0 到 0x10，一共 17 个平面。

第 0 号平面被称为 基本多文种平面（BMP，Basic Multilingual Plane） ，这个平面的所有字符码位只需要 16 位编码单元即可表示，所以它们可以继续使用 UTF-16 编码。

其他的平面被称为 辅助平面（supplementary plane） ，这些平面的字符被称为 增补字符 ，它们的码位均超过 16 位范围。

ES5 及之前的 JavaScript 的 Unicode 相关 API，只能以 UTF-16 来处理 BMP 的字符，所有字符串的操作都是基于 16 位编码单元。

因此，当:mahjong:这样的增补字符出现时，得到的结果就会与预期不符。

在 ES2015 之后，JavaScript 提供了新的 API 来支持 Unicode 码位，所以我们可以这么使用：

const str = ':mahjong:';

console.log(str.codePointAt(0)); // 126980

:point_right|type_1_2: 知识点 ： String.prototype.codePointAt(index) 方法返回字符串指定 index 位置的字符的 Unicode 码位，与旧的 charCodeAt 方法相比，它能够很好地支持增补字符。

对应地，我们有 String.fromCodePoint 方法将 CodePoint 转为对应的字符：

console.log(String.fromCodePoint(126980)); // :mahjong:

Unicode 转义

JavaScript 字符串支持 Unicode 转义，所以我们可以用码位的十六进制字符串加上前缀 \u 来表示一个字符，例如：

console.log('\u4e2d\u6587'); // 中文

0x4e2d 和 0x6587 分别是 20013 和 25991 的十六进制表示。

注意，Unicode 转义不仅仅可以用于字符串，实际上 \ uxxxx 也是可以用在标识符，并相互转换的。例如我们可以这么写：

const \u4e2d\u6587 = '测试';

console.log(中文); // 测试

上面的代码我们定义了一个中文变量，声明的时候我们用 Unicode 转义，console.log 的时候用它的变量名字符，这样也是没有问题的。

\u 和十六进制字符的这种表示法同样只适用于 BMP 的字符，所以如果我们试图使用它转义增补字符，直接这样是不行的：

console.log('\u1f004'); // ὆4

这样，引擎会把 \u1f004 解析成字符 \u1f00 和阿拉伯数字 4 组成的字符串。我们需要使用 {} 将编码包含起来，这样就可以了：

console.log('\u{1f004}'); // :mahjong:

代理对（surrogate pair）

为区别 BMP 来表示辅助平面，Unicode 引入代理对 (surrogate pair)，规定用 2 个 16 位编码单元来表示一个码位，具体规则是将一个字符按如下表示：

在 BMP 内的字符，仍然按照 UTF-16 的编码规则，使用两个字节来表示。
增补字符使用两组 16 位编码来表示一个字符规则为：
- 首先将它的编码减去 0x10000
- 然后写成 yyyy yyyy yyxx xxxx xxxx 的 20 位二进制形式
- 然后编码为 110110yy yyyyyyyy 110111xx xxxxxxxx 一共 4 个字节。

其中 110110yyyyyyyyyy 和 110111xxxxxxxxxx 就是两个代理字符，形成一组代理对，其中第一个代理字符的范围从 U+D800 到 U+DBFF，第二个代理字符的范围从 U+DC00 到 U+DFFF。

实现 getCodePoint

理解了代理对，我们就可以通过 charCodeAt 实现 getCodePoint 了：

function getCodePoint(str, idx = 0) {
const code = str.charCodeAt(idx);
if(code >= 0xD800 && code <= 0xDBFF) {
const high = code;
const low = str.charCodeAt(idx + 1);
return ((high - 0xD800) * 0x400) +
      (low - 0xDC00) + 0x10000;
  }
return code;
}

console.log(getCodePoint('中')); // 20013
console.log(getCodePoint(':mahjong:')); // 126980

同样地，我们也可以通过 fromCharCode 实现 fromCodePoint:

function fromCodePoint(...codePoints) {
let str = '';
for(let i = 0; i < codePoints.length; i++) {
let codePoint = codePoints[i];
if(codePoint <= 0xFFFF) {
      str += String.fromCharCode(codePoint);
    } else {
      codePoint -= 0x10000;
const high = (codePoint >> 10) + 0xD800;
const low = (codePoint % 0x400) + 0xDC00;
      str += String.fromCharCode(high) + String.fromCharCode(low);
    }
  }
return str;
}

console.log(fromCodePoint(126980, 20013)); // :mahjong:中

所以我们就可以用上面这样的思路来实现早期浏览器下的 polyfill。实际上 MDN 官方对 codePointAt [1] 和 fromCodePoint [2] 的说明中，就按照上面的思路提供了对应的 polyfill 方法。

getCodePointCount

JavaScript 字符串的 length 只能获得 UTF-16 字符的个数，所以前面看到的：

console.log(':mahjong:'.length); // 字符串长度为2

要获得 Unicode 字符数，有几个办法，比如使用 spread 操作是可以支持 Unicode 字符串转数组的，所以：

function getCodePointCount(str) {
return [...str].length;
}
console.log(getCodePointCount(':couple:中'));

或者使用带有 u 描述符的正则表达式：

function getCodePointCount(str) {
let result = str.match(/./gu);
return result ? result.length : 0;
}
console.log(getCodePointCount(':couple:中'));

扩展

Unicode 码位使用固定的 4 个字节来编码增补字符，而早期，UTF-8 编码则采用可变的 1~6 个字节来编码 Unicode 字符。

UTF-8 编码方式如下：

字节起始终止 byte1 byte2 byte3 byte4 byte5 byte6 1 U+0000 U+007F 0xxxxxxx 2 U+0080 U+07FF 110xxxxx 10xxxxxx 3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 4 U+10000 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 5 U+200000 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 6 U+4000000 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

在浏览器的 encodeURIComponent 和 Node 的 Buffer 默认采用 UTF-8 编码：

console.log(encodeURIComponent('中')); // %E4%B8%AD

const buffer = new Buffer('中');
console.log(buffer); // <Buffer e4 b8 ad>

这里的 E4、B8、AD 就是三个字节的十六进编码，我们试着转一下：

const byte1 = parseInt('E4', 16); // 228
const byte2 = parseInt('B8', 16); // 184
const byte3 = parseInt('AD', 16); // 173

const codePoint = (byte1 & 0xf) << 12 | (byte2 & 0x3f) << 6 | (byte3 & 0x3f);

console.log(codePoint); // 20013

我们将三个字节的控制码 1110、10、10 分别去掉，然后将它们按照从高位到低位的顺序拼接起来，正好就得到'中'的码位 20013。

所以我们也可以利用 UTF-8 编码规则，写另一个版本的通用方法来实现 getCodePoint：

function getCodePoint(char) {
const code = char.charCodeAt(0);
if(code <= 0x7f) return code;
const bytes = encodeURIComponent(char)
    .slice(1)
    .split('%')
    .map(c => parseInt(c, 16));

let ret = 0;
const len = bytes.length;
for(let i = 0; i < len; i++) {
if(i === 0) {
      ret |= (bytes[i] & 0xf) << 6 * (len - i - 1);
    } else {
      ret |= (bytes[i] & 0x3f) << 6 * (len - i - 1);
    }
  }
return ret;
}

console.log(getCodePoint('中')); // 20013
console.log(getCodePoint(':mahjong:')); // 126980

那么同样，我们可以实现 fromCodePoint：

function fromCodePoint(point) {
if(point <= 0xffff) return String.fromCharCode(point);
const bytes = [];
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
if(point < 0x1FFFFF) {
    bytes.unshift(point & 0x7 | 0xf0);
  } else if(point < 0x3FFFFFF) {
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x3 | 0xf8);
  } else {
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x1 | 0xfc);
  }
const code = '%' + bytes.map(b => b.toString(16)).join('%');
return decodeURIComponent(code);
}

console.log(fromCodePoint(126980)); // :mahjong:

关于 Unicode，你还有什么想讨论的，欢迎留言。

参考资料

[1] codePointAt:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Object s/String/codePointAt

[2] fromCodePoint:

https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint

iM7zuue.png!mobile

欢迎关注「 字节前端 ByteFE 」

简历投递联系邮箱「 [email protected] 」

点击阅读原文，快来加入我们吧！

Unicode 转义

代理对（surrogate pair）

实现 getCodePoint

getCodePointCount

扩展

Recommend

Java on Truffle：实现真正的元循环

今年GMV目标1万亿，谁给了抖音电商追赶拼多多的勇气？

苦寻落地场景的AR，看上“博物馆奇妙夜”

旺旺：长风破浪会有时，28年终不止

中国制造的新机会：从被世界转动到转动世界

哈啰出行冲击美股IPO！共享经济还是门好生意吗？

苹果带给芯片厂商的启示：不要将鸡蛋放在一个篮子里

卡替医疗完成新一轮亿元级融资，持续领跑TIL疗法赛道

15 个常见的 Node.js 面试问题及答案

西瓜追赶B站，张一鸣需要吃一颗定心丸

About Joyk