Oct 27, 2012

Unicode/utf-32 每个字符都是 4 字节

utf-16 每个字符都是 2 字节，UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535

由于计算机有big-endian, little-endian 之分，utf-16 文件可以在开头附上两字节的 BOM (byte order mark, U+FEFF/U+FEFF，它们刚好颠倒，所以可以知道字节顺序)

utf-8 可变长度的编码，ASCII (0-127)字符都为一字节，其他可能两字节，三字节。
utf-8 文件有前导码。
优点

效率高
没有字节顺序问题，不同计算机的字节流都是一致的。（很神奇）

缺点

查找复杂度为 O(n)，而不是 O(1)

GBK: GBK@Wikipedia

A character is encoded as 1 or 2 bytes. A byte in the range 00–7F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 96 characters and 32 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81–FE (that is, never 80 or FF), and the second byte is 40–7E for some areas and 80–FE for others.

因此 >127，则为两字节码，这是对的。
gkb 不仅包含中文，还包含希腊文和日文。但没有韩文。

系统如何处理？

utf-8, utf-16 都有前导码，因此系统可以知道是什么编码的文件。
GBK 没有前导码，应该设置为系统默认的编码（Windows 系统）

ultraedit （Windows 英文系统，不能识别字符的编码设为中文）保存文件：
默认编码 utf-8
ANSI 编码，也可以保存中文，等同于 GBK

Oct 27, 2012

Unicode, UTF-16, UTF-8, GBK, 系统如何处理

系统如何处理？

0 comments:

Labels

Blog Archive

My List