• UTF-8 and Unicode FAQ for Unix/Linux

    2008-02-21

    今天又碰到了编码的问题,准备写一篇,但又觉着积累不够,等待时机成熟。[未完成]

    【参看文档】 

    http://blog.csdn.net/lovekatherine/archive/2007/08/30/1765903.aspx 

    http://www.donews.net/holen/archive/2004/11/30/188182.aspx 

    http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html

    http://www.fileformat.info/

    http://baike.baidu.com/view/628163.htm

    http://hi.baidu.com/uniqcmt/blog/item/e1f44fdd88b086eb77c63803.html

    http://tieba.baidu.com/f?kz=94620397

    http://www.cnblogs.com/sirsunny/archive/2008/02/15/35183.html

     

    三流的国家制造产品,二流的国家提供技术,一流的国家制定标准。

    原文链接:http://www.cl.cam.ac.uk/~mgk25/unicode.html#

         by Markus Kuhn

    译文:

         UTF-8 and Unicode FAQ for Unix/Linux

        这篇文本是关于怎样在POSIX系统(Linux,Unix)上使用Unicode/UTF-8的全面的一站式的信息资源,在这儿你既能找到针对每个用户的介绍性的信息,也能找能面向经验丰富的开发者的详细参考。

        Unicode已经在各个层次开始取代ASCII,ISO 8859,扩展Unix 编码(Extended Unix Code),它不仅使用户可以处理这个星球上的几乎任何手写体和语言,同时它也支持了一套全面的数学与技术符号,以简化科学信息交流。

         通过使用UTF-8编码,Unicode能够以方便的、向后兼容的方式被使用在类似Unix的完全围绕ASCII码设计的环境,UTF-8是一种 Unicode在Unix,Linux及类似系统下的使用方式,现在是恰当的时机来确保您对他非常熟悉并且你的软件能够平滑的支持Unicode。

    Contents

     

    什么是UCS和ISO 10646?

        国际标准ISO 10646定义了通用字符集(UCS)。UCS是所有其他字符集标准的超集。 他保证了与其他字符集的双向兼容。 简单的说,这意味着如果你转换任何文本串到UCS再转回原来的编码不会出现信息的丢失。

        UCS包含了用于表示几乎所有已知语言的字符。这不仅包含拉丁语,希腊语,西里尔语,希伯来语,阿拉伯语,亚美尼亚,格鲁吉亚语,还包括中文,日文,韩文 这样的表意文字,以及平假名、片假名、Hangul(朝鲜文)、梵文、孟加拉语、果鲁穆奇语、古吉拉特语、奥里雅语, 泰米尔语、泰卢固语、埃纳德语、马 拉雅拉姆语、泰国语、老挝语、高棉语,、汉语拼音、藏语、Runic(古代北欧文字)、埃塞俄比亚语、 统一加拿大土著语音节、切罗基语、蒙古语、欧甘语、缅甸语、僧伽罗、马尔代夫语、彝语和其它语言。对于上面没有覆盖到的语言, 对他们如何以最佳方式编码来适合计算机的使用正在研究当中,并最终会被加入UCS。 这其中不仅包括古老的语言,如:楔形文字象形文字和各种印欧语系符号,甚至还包括一些被挑选出来的艺术语言,如托尔金创造的TengwarCirth。UCS 还包了大量的图形,印刷,数学及科学符号,包括那些由TeX,PostScrip,,APL,国际音标字母表(IPA),MS-DOS,MS- Windows,Macintosh,OCR字体以及很多字处理和出版系统提供的符号。 该标准还处在维护和更新当中,在未来的几年,会有更多的外来的和特殊的符号和字符被加入进来。

        ISO 10646 原来定义了一个31-bit的字符集。 与用32位的整数表示法不同点仅在低16位的216 字符子集被定义为UCS的一个平面。

    最常用的字符,包括各主要旧的编码标准中的所有字符,全部被放入了UCS的第一个平面(0x0000-0xFFFD),又被称为基本多语言平面(BMP) 或平面0。 那些后来添加的在16-bitBMP以外的字符大都是用于特殊应用,比如:古文字和科学符号。按照当前的计划,永远不会有字符被分配到从0x000000- 0x10FFFF覆盖了超过一百万的潜在未来字符的 21-bit代码空间之外。ISO 10646-1标准首次发布与1993年并且定义了字符集结构及BMP的内容。第二部分ISO 10646-2在2001年被加入进来定义了BMP之外的字符代码。在2003版,这两部分被合并进单一的ISO 10646标准之内。新的字符仍然被添加到一个连续的基础内,但已存在的字符将不会被改变并且越来越稳定。

    UCS不仅赋予了每个字符一个代码号,而且有一个正式的名称。一个用来表示UCS或Unicode值的16进制数通常加上前缀“U+”,例如:“U+0041”表示拉丁语大写字母A的UCS值。在U+0000-U+007F之间的 UCS字符与US-ASCII(ISO 646 IRV)相同,在 The U+0000-U+00FF之间的UCS字符与ISO 8859-1(Latin-1)相同。 范围U+E000-U+F8FF以及BMP之外的一个大的范围是被保留私用的。UCS也定义了若干方法来把一个字符串编码成一个字节序列,例如:UTF-8和UTF-16。

    USC标准的完整引用形式是:

    International Standard ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS) . Third edition, International Organization for Standardization, Geneva, 2003.

    该标准的PDF文档光盘可以从ISO在线订购,价格是112瑞士法郎。

    [NEW] 2006年9月,ISO在他的免费提供标准页面公布了ISO 10646:2003的免费在线文档。该ZIP文件的大小为82MB。

    什么是组合字符?

    UCS中的一些代码点被分配了组合字符。与打印机上无间隔重音键很相似。 单个的组合字符本身不是一个完整的字符。他是把一个重音符或其他附加符号加在了前一个字符上。这样就可以为任何字符添加重音符。 那些最重要的加重音的字符就像一般语言的中字法中用到的一样,在UCS中有他们自己的位置来保持与旧字符集的向后兼容。这些字符被称为预作字符。UCS中的预作字符是为了与没有组合字符的旧字符集,如ISO 8859兼容的而设的。组合字符机制允许一个字符添加重音和变音符号到另一个字符上。这对于科学符号尤为重要,例如数学方程式和国际音标字母表,可能需要一个基本字符加上一个或多个变音符号产生任何可能的组合。

        组合字符跟随着被修饰的字符。例如,德语的原音变音字符Ä (拉丁大写字母A加上分音符号)可以表示成预作字符UCS代码U+00C4,也可用一个普通的拉丁大写字母A加上一个组合分音符来表示:U+0041 U+0308。当需要堆叠多个重音符或在基本字符上面和下面加上组合标记时,可以使用多个组合字符。比如在泰国语言中,需要在一个基本字符上加上两个组合字符。

    什么是UCS实现级别?

    无法期望所有的系统都能够支持UCS的高级机制,例如组合字符。因此,ISO 10646制定了下列3个实现级别:

    级别 1

        不支持组合字符和Hangul Jamo字符。

        【Hangul Jamo是一种对预作的一个辅音和元音序列的现代韩国音节的代替性表示。他们要求对包括中朝鲜的朝鲜语的全面支持。 】

    级别 2

        类似于级别1,但对于某些语言,允许一个固定列表中的组合字符(例如,希伯来语,阿拉伯语, 梵文,孟加拉语,果鲁穆奇语,古吉拉特语,奥里亚语,泰米尔语,印.埃纳德语,卡纳达语,马拉雅拉姆语,泰语和老挝语)。如果没有对最起码的一些组合字符的支持,UCS就无法完整的表达这些语言。

    级别 3

        对所有UCS字符都支持,举例来说,数学家可以在任何字符上加波浪线或/和箭头。

    UCS是否被采纳为国家标准?

        是的,很多国家采用ISO 10646作为国家标准,有时后加入了额外的附件和对旧的国家标准的交叉引用,实施准则和规格不同的国家实现子集:

    • 中国: GB 13000.1-93
    • 日本: JIS X 0221-1:2001
    • 韩国: KS X 1005-1:1995 (包含 ISO 10646-1:1993 的修正案 1-7)
    • 越南: TCVN 6909:2001
      (这个“16-bit编码的越南字符集“是一个小的UCS子集,用来在政府机构内实现数据交换,截至日期是2002-07-01。)
    • 伊朗: ISIRI 6219:2002, 信息技术 — 使用Unicode的波斯语信息交换和显示机制。 (它是一个独立的在使用Unicode处理波斯语和阿拉伯语时提供的附加的国家指导和说明,而不是ISO 10646的版本或子集。)

    什么是Unicode?

    在20世纪80年代后期,有两个试图创建单一的统一字符集的项目。 一个是国际化标准组织(ISO)的ISO 10646项目,另一个是多语言软件厂商组织的一个协会(最初大多是美国厂商):Unicode工程组。幸运的是,这两个工程的参与者在1991年左右认识到世界并不需要两个不统一的字符集。他们一起努力,并为创立一个单一的码表而共同工作。这两个项目现在依然存在,并发布他们各自的独立标准,无论如何,Unicode协会和ISO/IEC JTC1/SC2同意保持Unicode和ISO 10646的兼容,并为未来的扩展进行紧密协调。Unicode 1.1与ISO 10646-1:1993相对应, Unicode 3.0与ISO 10646-1:2000相对应, Unicode 3.2 添加了ISO 10646-2:2001, and Unicode 4.0与ISO 10646:2003相对应, Unicode 5.0对应ISO 10646:2003 加上修正案1–3. 所有的Unicode 2.0之后的版本是兼容的,只有新的字符被加入,已有的字符在未来不会被删除或重命名。

    Unicode标准可以向普通书籍一样被订购,例如通过amazon.com 价格在60美元左右:

    The Unicode Consortium: The Unicode Standard 5.0,
    Addison-Wesley, 2006,
    ISBN 0-321-48091-0.

    如果你的工作频繁的处理文本和字符集,你一定要拥有一份Unicode标准。Unicode 5.0 还可以在线订购

    Unicode与ISO 10646之间的区别是什么?

        Unicode协会发布的Unicode 标准 对应ISO 10646 级别3.两个标准中所有的字符都在相同的位置并且有相同的名字。

    Unicode标准为一些字符定义了更多的语义,一般而言,对高质量的印刷出版系统的实现是一个更好的参考。Unicode详细说明了绘制一些语言的表示形式的算法(比如阿拉伯语),处理混合了拉丁语和希伯来语的双向文本,排序算法,字符串比较以及更多内容。

        另一方面ISO 10646标准就像旧得ISO 8859标准,不过是一个简单的字符集。他说明了一些与标准相关的术语,定义了一些编码方案,并包含了怎样用UCS与其他的已经建立的像ISO 6429,ISO 2022这样的ISO标准进行关联。 还有其他紧密相关的ISO标准,例如关于UCS字符串排序的ISO 14651 。ISO10646-1标准的一个好的特性是他用五种不同风格的变种提供了CJK的示例字形,而Unicode标准仅用一种汉字变种来显示CJK象形文字。

    什么是UTF-8?

    UCS和Unicode其实只是将整数分配给字符的代码表。对于如何将一系列的这样的字符或者他们各自的整数值用字节串表示,存在很多可选的方法。 最明显的两种编码方式是将Unicode文本存成2或4个字节序列。这两种编码的正式名称分别是UCS-2和UCS-4。除非另有说明,字符串对应的字节默认遵循大端字节序列。通过在每个ASCII字节前面插入0x00,一个ASCII或Latin-1格式的文件可以被转换成UCS-2格式的文件。如果想转换成UCS-4格式的文件,则需要在每个ASCII字节前插入3个0x00。

        在Unix下使用UCS-2 (或者 UCS-4)会导致很严重的问题。用这两种方式编码的字符串可能包含一些在文件名和其他C库函数参数中有特殊意义的宽字符,像“\0” 或 “/”。另外,多数处理ASCII码的UNIX工具不进行较大的修改是无法读取16-bit字符的。 因为这些原因,UCS-2在文件名,文本文件,环境变量等地方不是合适的Unicode外部编码。

        该UTF - 8编码定义在ISO 10646-1:2000附件D和还介绍了RFC 3629以及第3.9节的的Unicode 4.0标准,没有这些问题。

    UTF-8 编码在ISO 10646-1:2000 附录 D 中有定义,并且在RFC 3629和Unicode 4.0的3.9节中也有描述,是不存在这些问题的。 很显然,UTF-8是在Unix类型的操作系统中使用Unicode的合适的编码方法。

    UTF-8有以下特性:

    • UCS字符U+0000 - U+007F (ASCII)被简单的编码为 0x00 - 0x7F (ASCII 兼容)。这意味着仅包含7-bit ASCII 字符的文件或字符串在ASCII和UTF-8下具有相同的编码。
    • 所有的>U+007F 的UCS字符被编码成一个多字节序列,序列中每个字节的最高有效位均被置位。因此,没有ASCII 字节 (0x00-0x7F)能够作为其他字符的一部分出现。
    • 表示一个非ASCII字符的多字节序列的第一个字节总是在0xC0-0xFD的范围内,并且它显示了该字符有多少个字节组成。多字节序列的所有后续字节在0x80-0xBF的范围内。这使得重新同步更容易,编码变得无状态,对字节的丢失更鲁棒。
    • 所有可能的231 个UCS字符都能被编码。
    • UTF-8 编码理论上可达6字节长,不过16-bit BMP 字符最长仅为3字节。
    • 大端UCS-4字节串的排序顺序是被保留的。
    • 字节0xFE和0xFF在UTF-8编码中永远不会被用到。

    以下序列用来表示一个字符,这个序列依赖于Unicode字符的代码值:

    U-00000000 – U-0000007F:

    0xxxxxxx

    U-00000080 – U-000007FF:

    110xxxxx 10xxxxxx

    U-00000800 – U-0000FFFF:

    1110xxxx 10xxxxxx 10xxxxxx

    U-00010000 – U-001FFFFF:

    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    U-00200000 – U-03FFFFFF:

    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    U-04000000 – U-7FFFFFFF:

    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    以上的xxx 的位置用字符代码值得二进制表示进行填充。最右边的x 的权值最低。能表示同一字符代码值的多个字节序列中,只有最短的 那个可以在编码时被使用。注意,在多字节序列中,第一个字节中开头1 的数目就等于整个字节序列中的字节数。

    例子:Unicode字符 U+00A9 = 1010 1001 (版权符号) 在UTF-8中的编码是

        11000010 10101001 = 0xC2 0xA9

    而字符U+2260 = 0010 0010 0110 0000 (不等于号)被编码为:

        11100010 10001001 10100000 = 0xE2 0x89 0xA0

        这种编码的正式名称和拼写是UTF-8,其中UTF代表UCS Transformation Format。 请不要在任何文档中用其他方式(如utf8或UTF_8)来书写UTF-8,除非你是指一个变量名称而不是UTF-8编码本身。

        针对UTF-8解码开发者的重要说明:基于安全的原因,UTF-8解码函数不允许超出必要的编码长度的UTF-8序列。例如:字符U+000A (换行符) 只允许从UTF-8字节流中接受x0A的形式,而不是一下五种超长形式的任何一种:

      0xC0 0x8A
      0xE0 0x80 0x8A
      0xF0 0x80 0x80 0x8A
      0xF8 0x80 0x80 0x80 0x8A
      0xFC 0x80 0x80 0x80 0x80 0x8A

        任何过长UTF-8序列都可能被滥用,以绕过UTF-8的子串测试,这种测试只搜索最短的可能编码。所有过长UTF-8序列都是以以下一种字节模式作为开始的:

    1100000x (10xxxxxx)

    11100000 100xxxxx (10xxxxxx)

    11110000 1000xxxx (10xxxxxx 10xxxxxx)

    11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)

    11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)

        此外应注意,代码点范围U+D800 - U+DFFF (UTF-16 代理)以及U+FFFE 和 U+FFFF 不应该出现在正常的UTF-8 或 UCS-4 数据中。基于安全的原因,UTF-8解码函数应把它们当作畸形的或过长的序列。

    Markus Kuhn的解码器压力测试文件 含有畸形的或过长序列的系统性集合,可以用来帮助检验解码器的健壮性。

    Who invented UTF-8?

    The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike’s UTF-8 history). It replaced an earlier attempt to design a FSS/UTF (file system safe UCS transformation format) that was circulated in an X/Open working document in August 1992 by Gary Miller (IBM), Greger Leijonhufvud and John Entenmann (SMI) as a replacement for the division-heavy UTF-1 encoding from the first edition of ISO 10646-1. By the end of the first week of September 1992, Pike and Thompson had turned AT&T Bell Lab’s Plan 9 into the world’s first operating system to use UTF-8. They reported about their experience at the USENIX Winter 1993 Technical Conference, San Diego, January 25-29, 1993, Proceedings, pp. 43-50. FSS/UTF was briefly also referred to as UTF-2 and later renamed into UTF-8, and pushed through the standards process by the X/Open Joint Internationalization Group XOJIG.

    Where do I find nice UTF-8 example files?

    A few interesting UTF-8 example files for tests and demonstrations are:

    What different encodings are there?

    Both the UCS and Unicode standards are first of all large tables that assign to every character an integer number. If you use the term “UCS”, “ISO 10646”, or “Unicode”, this just refers to a mapping between characters and integers. This does not yet specify how to store these integers as a sequence of bytes in memory.

    ISO 10646-1 defines the UCS-2 and UCS-4 encodings. These are sequences of 2 bytes and 4 bytes per character, respectively. ISO 10646 was from the beginning designed as a 31-bit character set (with possible code positions ranging from U-00000000 to U-7FFFFFFF), however it took until 2001 for the first characters to be assigned beyond the Basic Multilingual Plane (BMP), that is beyond the first 216 character positions (see ISO 10646-2 and Unicode 3.1). UCS-4 can represent all UCS and Unicode characters, UCS-2 can represent only those from the BMP (U+0000 to U+FFFF).

    “Unicode” originally implied that the encoding was UCS-2 and it initially didn’t make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the extended “21-bit” Unicode in a way backwards compatible with UCS-2. The term UTF-32 was introduced in Unicode to describe a 4-byte encoding of the extended “21-bit” Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters above U-0010FFFF, while UCS-4 can cover all 231 code positions up to U-7FFFFFFF. The ISO 10646 working group has agreed to modify their standard to exclude code positions beyond U-0010FFFF, in order to turn the new UCS-4 and UTF-32 into practically the same thing.

    In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-8 in UCS and Unicode differed originally slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.)

    No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16, and UTF-32, though ISO 10646-1 says that Bigendian should be preferred unless otherwise agreed. It has become customary to append the letters “BE” (Bigendian, high-byte first) and “LE” (Littleendian, low-byte first) to the encoding names in order to explicitly specify a byte order.

    In order to allow the automatic detection of the byte order, it has become customary on some platforms (notably Win32) to start every Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE), also known as the Byte-Order Mark (BOM). Its byte-swapped equivalent U+FFFE is not a valid Unicode character, therefore it helps to unambiguously distinguish the Bigendian and Littleendian variants of UTF-16 and UTF-32.

    A full featured character encoding converter will have to provide the following 13 encoding variants of Unicode and UCS:

    UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE

    Where no byte order is explicitly specified, use the byte order of the CPU on which the conversion takes place and in an input stream swap the byte order whenever U+FFFE is encountered. The difference between outputting UCS-4 versus UTF-32 and UTF-16 versus UCS-2 lies in handling out-of-range characters. The fallback mechanism for non-representable characters has to be activated in UTF-32 (for characters > U-0010FFFF) or UCS-2 (for characters > U+FFFF) even where UCS-4 or UTF-16 respectively would offer a representation.

    Really just of historic interest are UTF-1, UTF-7, SCSU and a dozen other less widely publicised UCS encoding proposals with various properties, none of which ever enjoyed any significant use. Their use should be avoided.

    A good encoding converter will also offer options for adding or removing the BOM:

    • Unconditionally prefix the output text with U+FEFF.
    • Prefix the output text with U+FEFF unless it is already there.
    • Remove the first character if it is U+FEFF.

    It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons:

    • On POSIX systems, the locale and not magic file type codes define the encoding of plain text files. Mixing the two concepts would add a lot of complexity and break existing functionality.
    • Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter.
    • Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one.

    In addition to the encoding alternatives, Unicode also specifies various Normalization Forms, which provide reasonable subsets of Unicode, especially to remove encoding ambiguities caused by the presence of precomposed and compatibility characters:

    • Normalization Form D (NFD): Split up (decompose) precomposed characters into combining sequences where possible, e.g. use U+0041 U+0308 (LATIN CAPITAL LETTER A, COMBINING DIAERESIS) instead of U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS). Also avoid deprecated characters, e.g. use U+0041 U+030A (LATIN CAPITAL LETTER A, COMBINING RING ABOVE) instead of U+212B (ANGSTROM SIGN).
    • Normalization Form C (NFC): Use precomposed characters instead of combining sequences where possible, e.g. use U+00C4 (“Latin capital letter A with diaeresis”) instead of U+0041 U+0308 (“Latin capital letter A”, “combining diaeresis”). Also avoid deprecated characters, e.g. use U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) instead of U+212B (ANGSTROM SIGN).
      NFC is the preferred form for Linux and WWW.
    • Normalization Form KD (NFKD): Like NFD, but avoid in addition the use of compatibility characters, e.g. use “fi” instead of U+FB01 (LATIN SMALL LIGATURE FI).
    • Normalization Form KC (NFKC): Like NFC, but avoid in addition the use of compatibility characters, e.g. use “fi” instead of U+FB01 (LATIN SMALL LIGATURE FI).

    A full-featured character encoding converter should also offer conversion between normalization forms. Care should be used with mapping to NFKD or NFKC, as semantic information might be lost (for instance U+00B2 (SUPERSCRIPT TWO) maps to 2) and extra mark-up information might have to be added to preserve it (e.g., <SUP>2</SUP> in HTML).

    What programming languages support Unicode?

    More recent programming languages that were developed after around 1993 already have special data types for Unicode/ISO 10646-1 characters. This is the case with Ada95, Java, TCL, Perl, Python, C# and others.

    ISO C 90 specifies mechanisms to handle multi-byte encoding and wide characters. These facilities were improved with Amendment 1 to ISO C 90 in 1994 and even further improvements were made in the ISO C 99 standard. These facilities were designed originally with various East-Asian encodings in mind. They are on one side slightly more sophisticated than what would be necessary to handle UCS (handling of “shift sequences”), but also lack support for more advanced aspects of UCS (combining characters, etc.). UTF-8 is an example of what the ISO C standard calls multi-byte encoding. The type wchar_t, which in modern environments is usually a signed 32-bit integer, can be used to hold Unicode characters.

    Unfortunately, wchar_t was already widely used for various Asian 16-bit encodings throughout the 1990s. Therefore, the ISO C 99 standard was bound by backwards compatibility. It could not be changed to require wchar_t to be used with UCS, like Java and Ada95 managed to do. However, the C compiler can at least signal to an application that wchar_t is guaranteed to hold UCS values in all locales. To do so, it defines the macro __STDC_ISO_10646__ to be an integer constant of the form yyyymmL. The year and month refer to the version of ISO/IEC 10646 and its amendments that have been implemented. For example, __STDC_ISO_10646__ == 200009L if the implementation covers ISO/IEC 10646-1:2000.

    How should Unicode be used under Linux?

    Before UTF-8 emerged, Linux users all over the world had to use various different language-specific extensions of ASCII. Most popular were ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8 / ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, BIG5 in Taiwan, etc. This made the exchange of files difficult and application software had to worry about various small differences between these encodings. Support for these encodings was usually incomplete, untested, and unsatisfactory, because the application developers rarely used all these encodings themselves.

    Because of these difficulties, major Linux distributors and application developers are now phasing out these older legacy encodings in favour of UTF-8. UTF-8 support has improved dramatically over the last few years and many people now use UTF-8 on a daily basis in

    • text files (source code, HTML files, email messages, etc.)
    • file names
    • standard input and standard output, pipes
    • environment variables
    • cut and paste selection buffers
    • telnet, modem, and serial port connections to terminal emulators

    and in any other places where byte sequences used to be interpreted in ASCII.

    In UTF-8 mode, terminal emulators such as xterm or the Linux console driver transform every keystroke into the corresponding UTF-8 sequence and send it to the stdin of the foreground process. Similarly, any output of a process on stdout is sent to the terminal emulator, where it is processed with a UTF-8 decoder and then displayed using a 16-bit font.

    Full Unicode functionality with all bells and whistles (e.g. high-quality typesetting of the Arabic and Indic scripts) can only be expected from sophisticated multi-lingual word-processing packages. What Linux supports today on a broad base is far simpler and mainly aimed at replacing the old 8- and 16-bit character sets. Linux terminal emulators and command line tools usually only support a Level 1 implementation of ISO 10646-1 (no combining characters), and only scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are supported that need no further processing support. At this level, UCS support is very comparable to ISO 8859 support and the only significant difference is that we have now thousands of different characters available, that characters can be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width).

    Level 2 support in the form of combining characters for selected scripts (in particular Thai) and Hangul Jamo is in parts also available (i.e., some fonts, terminal emulators and editors support it via simple overstringing), but precomposed characters should be preferred over combining character sequences where available. More formally, the preferred way of encoding text in Unicode under Linux should be Normalization Form C as defined in Unicode Technical Report #15.

    One influential non-POSIX PC operating system vendor (whom we shall leave unnamed here) suggested that all Unicode files should start with the character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role also referred to as the “signature” or “byte-order mark (BOM)”, in order to identify the encoding and byte-order used in a file. Linux/Unix does not use any BOMs and signatures. They would break far too many existing ASCII syntax conventions (such as scripts starting with #!). On POSIX systems, the selected locale identifies already the encoding expected in all input and output files of a process. It has also been suggested to call UTF-8 files without a signature “UTF-8N” files, but this non-standard term is usually not used in the POSIX world.

    Before you switch to UTF-8 under Linux, update your installation to a recent distribution with up-to-date UTF-8 support. This is particular the case if you use an installation older than SuSE 9.1 or Red Hat 8.0. Before these, UTF-8 support was not yet mature enough to be recommendable for daily use.

    Red Hat Linux 8.0 (September 2002) was the first distribution to take the leap of switching to UTF-8 as the default encoding for most locales. The only exceptions were Chinese/Japanese/Korean locales, for which there were at the time still too many specialized tools available that did not yet support UTF-8. This first mass deployment of UTF-8 under Linux caused most remaining issues to be ironed out rather quickly during 2003. SuSE Linux9.1 (May 2004). It was followed by Ubuntu Linux, the first Debian-derivative that switched to UTF-8 as the system-wide default encoding. With the migration of the three most popular Linux distributions, UTF-8 related bugs have now been fixed in practically all well-maintained Linux tools. Other distributions can be expected to follow soon. then switched its default locales to UTF-8 as well, as of version

    How do I have to modify my software?

    If you are a developer, there are several approaches to add UTF-8 support. We can split them into two categories, which I will call soft and hard conversion. In soft conversion, data is kept in its UTF-8 form everywhere and only very few software changes are necessary. In hard conversion, any UTF-8 data that the program reads will be converted into wide-character arrays and will be handled as such everywhere inside the application. Strings will only be converted back to UTF-8 at output time. Internally, a character remains a fixed-size memory object.

    We can also distinguish hard-wired and locale-dependent approaches for supporting UTF-8, depending on how much the string processing relies on the standard library. C offers a number of string processing functions designed to handle arbitrary locale-specific multibyte encodings. An application programmer who relies entirely on these can remain unaware of the actual details of the UTF-8 encoding. Chances are then that by merely changing the locale setting, several other multi-byte encodings (such as EUC) will automatically be supported as well. The other way a programmer can go is to hardcode knowledge about UTF-8 into the application. This may lead in some situations to significant performance improvements. It may be the best approach for applications that will only be used with ASCII and UTF-8.

    Even where support for every multi-byte encoding supported by libc is desired, it may well be worth to add extra code optimized for UTF-8. Thanks to UTF-8’s self-synchronizing features, it can be processed very efficiently. The locale-dependent libc string functions can be two orders of magnitude slower than equivalent hardwired UTF-8 procedures. A bad teaching example was GNU grep 2.5.1, which relied entirely on locale-dependent libc functions such as mbrlen() for its generic multi-byte encoding support. This made it about 100× slower in multibyte mode than in single-byte mode! Other applications with hardwired support for UTF-8 regular expressions (e.g., Perl 5.8) do not suffer this dramatic slowdown.

    Most applications can do very fine with just soft conversion. This is what makes the introduction of UTF-8 on Unix feasible at all. To name two trivial examples, programs such as cat and echo do not have to be modified at all. They can remain completely ignorant as to whether their input and output is ISO 8859-2 or UTF-8, because they handle just byte streams without processing them. They only recognize ASCII characters and control codes such as '\n' which do not change in any way under UTF-8. Therefore the UTF-8 encoding and decoding is done for these applications completely in the terminal emulator.

    A small modification will be necessary for any program that determines the number of characters in a string by counting the bytes. With UTF-8, as with other multi-byte encodings, where the length of a text string is of concern, programmers have to distinguish clearly between

    1. the number of bytes,
    2. the number of characters,
    3. the display width (e.g., the number of cursor position cells in a VT100 terminal emulator)

    of a string.

    C’s strlen(s) function always counts the number of bytes. This is the number relevant, for example, for memory management (determination of string buffer sizes). Where the output of strlen is used for such purposes, no change will be necessary.

    The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.

    In applications written for ASCII or ISO 8859, a far more common use of strlen is to predict the number of columns that the cursor of the terminal will advance if a string is printed. With UTF-8, neither a byte nor a character count will predict the display width, because ideographic characters (Chinese, Japanese, Korean) will occupy two column positions, whereas control and combining characters occupy none. To determine the width of a string on the terminal screen, it is necessary to decode the UTF-8 sequence and then use the wcwidth function to test the display width of each character, or wcswidth to measure the entire string.

    For instance, the ls program had to be modified, because without knowing the column widths of filenames, it cannot format the table layout in which it presents directories to the user. Similarly, all programs that assume somehow that the output is presented in a fixed-width font and format it accordingly have to learn how to count columns in UTF-8 text. Editor functions such as deleting a single character have to be slightly modified to delete all bytes that might belong to one character. Affected were for instance editors (vi, emacs, readline, etc.) as well as programs that use the ncurses library.

    Any Unix-style kernel can do fine with soft conversion and needs only very minor modifications to fully support UTF-8. Most kernel functions that handle strings (e.g. file names, environment variables, etc.) are not affected at all by the encoding. Modifications were necessary in Linux the following places:

    • The console display and keyboard driver (another VT100 emulator) have to encode and decode UTF-8 and should support at least some subset of the Unicode character set. This had already been available in Linux as early as kernel 1.2 (send ESC %G to the console to activate UTF-8 mode).
    • External file system drivers such as VFAT and WinNT have to convert file name character encodings. UTF-8 is one of the available conversion options, and the mount command has to tell the kernel driver that user processes shall see UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-8 is the only available encoding that guarantees a lossless conversion here.
    • The tty driver of any POSIX system supports a “cooked” mode, in which some primitive line editing functionality is available. In order to allow the character-erase function (which is activated when you press backspace) to work properly with UTF-8, someone needs to tell it not count continuation bytes in the range 0x80-0xBF as characters, but to delete them as part of a UTF-8 multi-byte sequence. Since the kernel is ignorant of the libc locale mechanics, another mechanism is needed to tell the tty driver about UTF-8 being used. Linux kernel versions 2.6 or newer support a bit IUTF8 in the c_iflag member variable of struct termios. If it is set, the “cooked” mode line editor will treat UTF-8 multi-byte sequences correctly. This mode can be set from the command shell with “stty iutf8”. Xterm and friends should set this bit automatically when called in a UTF-8 locale.

    C support for Unicode and UTF-8

    Starting with GNU glibc 2.2, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. This is signalled to applications by the definition of the __STDC_ISO_10646__ macro as required by ISO C99. The ISO C multi-byte conversion functions (mbsrtowcs(), wcsrtombs(), etc.) are fully implemented in glibc 2.2 or higher and can be used to convert between wchar_t and any locale-dependent multibyte encoding, including UTF-8, ISO 8859-1, etc.

    For example, you can write

      #include <stdio.h>
    #include <locale.h>

    int main()
    {
    if (!setlocale(LC_CTYPE, "")) {
    fprintf(stderr, "Can't set the specified locale! "
    "Check LANG, LC_CTYPE, LC_ALL.\n");
    return 1;
    }
    printf("%ls\n", L"Schöne Grüße");
    return 0;
    }

    Call this program with the locale setting LANG=de_DE and the output will be in ISO 8859-1. Call it with LANG=de_DE.UTF-8 and the output will be in UTF-8. The %ls format specifier in printf calls wcsrtombs in order to convert the wide character argument string into the locale-dependent multi-byte encoding.

    Many of C’s string functions are locale-independent and they just look at zero-terminated byte sequences:

      strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr
    strcspn strspn strpbrk strstr strtok

    Some of these (e.g. strcpy) can equally be used for single-byte (ISO 8859-1) and multi-byte (UTF-8) encoded character sets, as they need no notion of how many byte long a character is, while others (e.g., strchr) depend on one character being encoded in a single char value and are of less use for UTF-8 (strchr still works fine if you just search for an ASCII character in a UTF-8 string).

    Other C functions are locale dependent and work in UTF-8 locales just as well:

      strcoll strxfrm

    How should the UTF-8 mode be activated?

    If your application is soft converted and does not use the standard locale-dependent C multibyte routines (mbsrtowcs(), wcsrtombs(), etc.) to convert everything into wchar_t for processing, then it might have to find out in some way, whether it is supposed to assume that the text data it handles is in some 8-bit encoding (like ISO 8859-1, where 1 byte = 1 character) or UTF-8. Once everyone uses only UTF-8, you can just make it the default, but until then both the classical 8-bit sets and UTF-8 may still have to be supported.

    The first wave of applications with UTF-8 support used a whole lot of different command line switches to activate their respective UTF-8 modes, for instance the famous xterm -u8. That turned out to be a very bad idea. Having to remember a special command line option or other configuration mechanism for every application is very tedious, which is why command line options are not the proper way of activating a UTF-8 mode.

    The proper way to activate UTF-8 is the POSIX locale mechanism. A locale is a configuration setting that contains information about culture-specific conventions of software behaviour, including the character encoding, the date/time notation, alphabetic sorting rules, the measurement system and common office paper size, etc. The names of locales usually consist of ISO 639-1 language and ISO 3166-1 country codes, sometimes with additional encoding names or other qualifiers.

    You can get a list of all locales installed on your system (usually in /usr/lib/locale/) with the command locale -a. Set the environment variable LANG to the name of your preferred locale. When a C program executes the setlocale(LC_CTYPE, "") function, the library will test the environment variables LC_ALL, LC_CTYPE, and LANG in that order, and the first one of these that has a value will determine which locale data is loaded for the LC_CTYPE category (which controls the multibyte conversion functions). The locale data is split up into separate categories. For example, LC_CTYPE defines the character encoding and LC_COLLATE defines the string sorting order. The LANG environment variable is used to set the default locale for all categories, but the LC_* variables can be used to override individual categories. Do not worry too much about the country identifiers in the locales. Locales such as en_GB (English in Great Britain) and en_AU (English in Australia) differ usually only in the LC_MONETARY category (name of currency, rules for printing monetary amounts), which practically no Linux application ever uses. LC_CTYPE=en_GB and LC_CTYPE=en_AU have exactly the same effect.

    You can query the name of the%


    收藏到:Del.icio.us




    Tag:

发表评论

您将收到博主的回复邮件
记住我