Oct 27, 2012

Unicode, UTF-16, UTF-8, GBK, 系统如何处理


Unicode/utf-32 每个字符都是 4 字节

utf-16 每个字符都是 2 字节,UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535

由于计算机有big-endian, little-endian 之分,utf-16 文件可以在开头附上两字节的 BOM (byte order mark, U+FEFF/U+FEFF,它们刚好颠倒,所以可以知道字节顺序)

utf-8 可变长度的编码,ASCII (0-127)字符都为一字节,其他可能两字节,三字节。
utf-8 文件有前导码
优点
  • 效率高
  • 没有字节顺序问题,不同计算机的字节流都是一致的。(很神奇)
缺点
  • 查找复杂度为 O(n),而不是 O(1)

A character is encoded as 1 or 2 bytes. A byte in the range 00–7F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 96 characters and 32 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81–FE (that is, never 80 or FF), and the second byte is 40–7E for some areas and 80–FE for others.
因此 >127,则为两字节码,这是对的。
gkb 不仅包含中文,还包含希腊文和日文。但没有韩文。

系统如何处理?

utf-8, utf-16 都有前导码,因此系统可以知道是什么编码的文件。
GBK 没有前导码,应该设置为系统默认的编码(Windows 系统)

ultraedit (Windows 英文系统,不能识别字符的编码设为中文)保存文件:
默认编码 utf-8
ANSI 编码,也可以保存中文,等同于 GBK

0 comments: