Internet Draft R. L. Ullmann Process Software Corporation June 12, 1992 NET-UTF: International character set 1 Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. This draft expires on or before December 12, 1992. 2 Introduction The Internet is no longer a creature of the United States, much less of DARPA (the US Defence Advance Research Projects Agency). It is now an international network, and the ability to communicate in any of the world languages on an equal footing is an imperative. This draft attempts to track the development of ISO 10646 [2], a moving target at this writing. The reference citation below is to the 2nd 10646 draft, for which balloting has just concluded. (June 1992). It is therefore expected that this memo will potentially change until the publication of IS 10646. Some of the following text refers to 10646 in the present tense, as if it is IS now; it should be understood in this context. 3 Motivation A working group (JTC1/SC2/WG2) of the ISO is currently working on specification of a 32-bit Universal Character Set (UCS), ISO 10646 [2], including as annex F the specification of a Universal Transfer Format (UTF) algorithm that would convert the 32-bit codes into 1-5 octet sequences. The UTF code is deliberately designed to be useable with existing software. Ullmann DRAFT: expires December 12, 1992 [page 1] Internet Draft NET-UTF: International Character Set June 12, 1992 This document is intended to facilitate the earliest possible use of IS-10646-UTF as the new universal code for text within the Internet, referring to it as NET-UTF, by analogy with NET-ASCII, the Internet standard for the use of ASCII-7 within the Internet. NET-UTF is upward-compatible with NET-ASCII. Since "upward/downward compatible" has been much abused, a more precise definition is in order: 1. A document in 7 bit NET-ASCII can be taken to be in NET-UTF without altering its interpretation; the NET-UTF representation of the document is bit-for-bit identical to NET-ASCII. 2. A document in NET-UTF that consists only of characters representable by NET-ASCII, either by design or happenstance, is identical to the same document in NET-ASCII. This document is itself in NET-UTF. 4. Terminology: Octets and Characters Since NET-UTF is a multi-octet character set, an important distinction must be drawn between the term "octet" and the term "character". 1. An octet is an 8-bit datum, which may contain values 0 through 255 decimal. 2. A character is a conceptual entity, such as "A" or "" (o with 2 dots over it). Its coded representation is not part of the definition of "character". 3. A coded character set is a set of unambigous rules that establish a set of characters and the relationship between the characters of the set and their coded representation. Note that the same character coded in different character sets may result in different octets. For example, the character (o-diaresis), code point 246. decimal in UCS: in the variant of IS 646 character set used for (e.g.) Swedish it is the octet 123., but in IS 8859-1 the octet 246., and UTF it is the 2 octets 160. 246. Particular attention must be paid to this distinction when reading RFCs that predate 10646 (i.e. all of them, at this writing ...). The RFCs often say character when the intention is "octet", assuming an equivalence that, while valid at the time, is no longer valid. This can lead directly to fruitless argument of Original Intent. It is more important to determine which definition is more useful in any given case, and refine the definition of the protocol(s) described appropriately. Ullmann DRAFT: expires December 12, 1992 [page 2] Internet Draft NET-UTF: International Character Set June 12, 1992 In almost all cases in the existing RFCs, "character" should be read as "octet", and specific "character X" as "octet containing the value assigned in ASCII-7 to the character X". 4 Description of IS-10646-UTF Important Note: This section and the next are not to be taken as in any way authoritative. Only the IS itself is authoritative. This description is provided only for informal reference and exposition. IS 10646 defines (will define) a 32 bit set, UCS-4, with characters assigned to integer code points in the range 0. to 2147483647. (Note that the high bit of the first octet is never set; codes with it set may be used for any internal purpose within a device, but may not appear in external text conforming to the standard.) The first 256 code points are IS 8859-1 (aka "ASCII-8"). For example, LATIN CAPITAL LETTER A is 65., or 00 00 00 41 in hex. This is the same integer code point in 7 bit ASCII, in 8859/1, and in UCS-4. In UCS-4 it contains octets (00 in this case) that "look like" control codes to software not recognizing multi-octet codes. The standard defines (now in second working draft) a Universal Transfer Format, to address this problem. Codes are mapped through an algorithm to a 1-5 octet sequence. Ranges of the UCS code space are mapped to ranges of UTF as follows: UCS-4 (decimal codes) UTF (hexadecimal octets) 0. to 159. 00 to 9F 160. to 255. A0 A0 to A0 FF 256. to 16405. A1 21 to F5 FF 16406. to 233005. F6 21 21 to FB FF FF 233006. to 4294967295. FC 21 21 21 21 to FF 59 3C C8 C3 The octets used in the multi-octet characters after the first are always in the range(s) 21-7E and A0-FF, and therefore do not look like control codes to software unaware that it is transmitting a multi-octet code set. There are no shifts or locking shifts, a major technical advantage over the previous draft of 10646. Any control character (e.g. including SPACE) thus provides a resynchronization point, if an error occurs. UTF also provides an advantage of compactness in most cases, especially when small amounts of text in various languages are intermixed, with the majority in a Latin language. (One might say "English", but that might lead to complaints of parochialism. So one doesn't.) Ullmann DRAFT: expires December 12, 1992 [page 3] Internet Draft NET-UTF: International Character Set June 12, 1992 It also facilitates further compression with general purpose text compression techniques, since the most useful statistics are found in the tri-octet range, exactly where they are in NET-ASCII text, almost regardless of the language used. (The word "tri-octet" has not appeared in print before to this author's knowledge, but "trigraph", the usual term used in cryptography and data compression research, would be a misnomer here.) 5 Outline of code table The following is an outline of the (DRAFT) 10646 code table. As with the information in the prvious section, this may change up to the issuance of the IS, and is authoritative, refer to the IS. Each block is described by its name in the (DRAFT) standard (CJK means Chinese-Japanese-Korean), the range of code points in the block in decimal, and the first and last code points in UTF in hexadecimal. The last column is the actual UTF code for the first and last points in the block. These may not (and in many cases do not) correspond to assigned characters; even if this document is displayed with a 10646-UTF rendering process, it will not show anything useful for those code points. ISO-646 IRV 32. 20 126. 7E ~ Latin-1 Supplement 160. A0 A0 255. A0 FF Extended Latin-A 256. A1 21 ! 383. A1 C1 Extended Latin-B 384. A1 C2 591. A2 D3 IPA Extensions 592. A2 D4 687. A3 54 T Spacing modifier letters 688. A3 55 U 767. A3 C5 Combining diacritical marks 768. A3 C6 879. A4 56 V Greek 880. A4 57 W 1023. A5 28 ( Cyrillic 1024. A5 29 ) 1279. A6 6A j Armenian 1328. A6 BC 1423. A7 3C < Ullmann DRAFT: expires December 12, 1992 [page 4] Internet Draft NET-UTF: International Character Set June 12, 1992 Hebrew 1424. A7 3D = 1535. A7 CD  Arabic 1536. A7 CE  1791. A9 30 0 Devanagari 2304. AB D6 2431. AC 76 v Bengali 2432. AC 77 w 2559. AD 38 8 Gurmukhi 2560. AD 39 9 2687. AD D9 Gujarati 2688. AD DA 2815. AE 7A z Oriya 2816. AE 7B { 2943. AF 3C < Tamil 2944. AF 3D = 3071. AF DD Telugu 3072. AF DE 3199. B0 7E ~ Kannada 3200. B0 A0 3327. B1 40 @ Malayalam 3328. B1 41 A 3455. B1 E1 Thai 3584. B2 A4 3711. B3 44 D Lao 3712. B3 45 E 3839. B3 E5 Tibetan 4096. B5 49 I 4191. B5 C9 Georgian 4256. B6 2B + 4351. B6 AB  Additional Extended Latin 7680. C8 2F / 7935. C9 70 p Greek Extensions 7936. C9 71 q 8191. CA D3 General Punctuation 8192. CA D4 8303. CB 64 d Ullmann DRAFT: expires December 12, 1992 [page 5] Internet Draft NET-UTF: International Character Set June 12, 1992 Superscripts and Subscripts 8304. CB 65 e 8351. CB B5 ƶ Currency Symbols 8352. CB B6  8399. CB E5 Ɔ Combining Diacritical Marks 8400. CB E6 Ƒ For Symbols 8447. CC 36 6 Letterlike Symbols 8448. CC 37 7 8527. CC A7  Number Forms 8528. CC A8 ǩ 8591. CC E7 LJ Arrows 8592. CC E8 NJ 8703. CD 78 x Mathematical Operators 8704. CD 79 y 8959. CE DB Miscellaneous Technical 8960. CE DC Ϛ 9215. D0 3E > Control Pictures 9216. D0 3F ? 9279. D0 7E ~ Optical Character Recognition 9280. D0 A0 ɹ 9311. D0 BF ɨ Enclosed Alphanumerics 9312. D0 C0 ɿ 9471. D1 A1 Box Drawing 9472. D1 A2 9599. D2 42 B Block Elements 9600. D2 43 C 9631. D2 62 b Geometric Shapes 9632. D2 63 c 9727. D2 E3 Miscellaneous Dingbats 9728. D2 E4 ф 9983. D4 46 F Dingbats 9984. D4 47 G 10175. D5 48 H CJK Symbols and Punctuation 12288. E0 5F _ 12351. E0 BF Hiragana 12352. E0 C0 12447. E1 40 @ Katakana 12448. E1 41 A Ullmann DRAFT: expires December 12, 1992 [page 6] Internet Draft NET-UTF: International Character Set June 12, 1992 12543. E1 C1 Bopomofo 12544. E1 C2 12591. E1 F1 Hangul Jamo 12592. E1 F2 12687. E2 72 r CJK Miscellaneous 12688. E2 73 s 12703. E2 A3 Combining Hangul Jamo 12704. E2 A4 12799. E3 24 $ Enclosed CJK Letters 12800. E3 25 % and Months 13055. E4 66 f CJK Compatibility Words 13056. E4 67 g and Hours 13183. E5 28 ( Hangul 13312. E5 CA 15663. F2 32 2 Supplementary Hangul 15872. F3 45 E 17807. F6 28 68 (h Old Hangul 17920. F6 28 FA ( 19599. F6 31 DB 1 CJK Unified Ideograms 19968. F6 33 D0 3 40959. F6 C3 4C L Private Use Area 57344. F7 3A 79 :y 63487. F7 5A D9 Z CJK Compatibility Ideographs 63744. F7 5C 3D \= 64255. F7 5E E1 ^ Alphabetic Presentation Forms 64256. F7 5E E2 ^ 64335. F7 5F 52 _R Arabic Presentation Forms 64592. F7 60 B6 ` 65023. F7 62 E9 b CJK Compatibility Forms 65072. F7 63 3B c; 65103. F7 63 5A cZ Small Form Variants 65104. F7 63 5B c[ 65135. F7 63 7A cz Arabic Presentation Forms-B 65136. F7 63 7B c{ 65278. F7 64 4B dK Halfwidth and Fullwidth Forms 65280. F7 64 4D dM 65519. F7 65 7E e~ Ullmann DRAFT: expires December 12, 1992 [page 7] Internet Draft NET-UTF: International Character Set June 12, 1992 Specials 65520. F7 65 A0 e 65533. F7 65 AD e Private Use Planes 14680064. FC 23 35 46 3D #5F= 16777215. FC 23 6F 57 D7 #oW Private Use Groups 1610612736. FD 4D D6 E4 D8 M 2147483647. FD BD 2B B9 40 ī+@ 6 Notes on particular Internet protocols Most of the common Internet application protocols, as implemented by commercial software, already provide for the use of 8 bit characters, usually to permit use of IS 8859 variants for alphabetic languages. Public domain software is often lacking in this area, not having been subjected to international commercial pressures, and usually being hacked by the user to handle character sets other than 7 bit ASCII where necessary. Which is almost everywhere: the only two languages that can be written properly with ASCII-7 are Hawaiian and Swahili (when written in the Latin script; the Arabic script is also used). English cannot be (Rle, clich, coperate, faade; although the spelling variants without the diacritics are considered acceptable), and one wants to spell proper names properly: ngstrm, Mtley Cre. (In that last case, one concludes that the spelling was chosen for appearance; the implied pronunciation is awesome.) Most commercial implementations also permit either CRLF or LF as a line terminator, even when standards dictate CRLF. It is perhaps useful to move toward the simple use of LF as line terminator in NET-UTF. This would be consistent with the de facto text standard in NFS, derived from Unix. The following sections are simply observations on the major application protocols, without any attempt to be comprehensive. (IHR is the Internet Host Requirements, RFC 1123). 6.1 TELNET The TELNET (Internet remote login protocol) has very specific requirements for the user of CR and LF, best explained in IHR, 3.3.1. It also (IHR, 3.4.1) specifies character set transparency, at least for 7 bit characters. Most implementations actually already provide 8 bit transparency, whether "binary mode" is negotiated or not. If the binary mode is negotiated, this serves to turn off newline interpretation and other control interpretation (if any), not to enable 8 bit transmission. Ullmann DRAFT: expires December 12, 1992 [page 8] Internet Draft NET-UTF: International Character Set June 12, 1992 While the default assumption should now be NET-UTF, the actual set used may be entirely a private issue; note that TELNET servers and clients may have (e.g.) knowledge of terminal types. 6.2 FTP FTP (Internet file transfer protocol) uses two separate session connections, one for issuing commands and responses, the other (re-)established for each file transferred, and carrying only data. 6.2.1 Control connection NET-UTF is only to be used in path names of files to be transferred. This is an extension of the specifications of RFC959 and IHR 4.1.4.1, which specify any 7 bit character other than CR and LF. Note that (for example) the BSD Unix file system allows any 8 bit character (nominally IS 8859-1), and the FTP implementation permits these names to be used. 6.2.2 Data connection When the data is being transferred in text mode, most existing implementations permit 8 bit characters, and accept either LF or CRLF as line terminators. 6.3 SMTP 6.3.1 Protocol (RFC 821) In the SMTP (Internet mail transfer protocol) itself (distinguished from the message header) NET-UTF does not make any real difference, since electronic mail addresses consist of a restricted set of characters. The other parts of the command syntax are entirely keywords and fixed elements. NET-UTF might be used in the text of responses (which are not interpreted by the protocol), particularily when giving a multi-lingual response. The only issue of actual protocol failure in data transmission might occur when an octet of value (hex) AE is the only content of a line; if the data is being "stripped" to 7 bit (i.e. by non-NET-UTF compliant software) this might look like the dot (hex 2E) used to end the message. Ullmann DRAFT: expires December 12, 1992 [page 9] Internet Draft NET-UTF: International Character Set June 12, 1992 This is not considered a problem, since AE is never used by itself in IS 10646-UTF. It is always either the first octet of a 2 octet character, or a subsequent octet in a multi-octet character; the problem does not arise, except perhaps by intentional mischief. No negotiation of 8 bit transmission is done; this would simply introduce a presently non-existent failure mode. Communities of users that need to use 8 bit character sets already are using the protocol with 8 bit transmission. More importantly: Internet mail transfer agents do not now have license to modify the content of messages in any way (although public domain software often does, to the detriment of everyone), it would be a serious regression to allow any such license. 6.3.2 Internet message headers (RFC 822) The use of NET-UTF in message headers is effectively already implemented, with the notable exception of the large quantity of software that does not even attempt to comply with the existing standard, to which no concession need be made. (The existing standard being a decade old at this writing.) Where message header fields contain arbitrary text, either there are no restrictions (e.g. the subject field) or a well defined combination of single "character" and string quoting is used. Present implementations consider this to be octet-level quoting (i.e. given that there has been no distinction between "octet" and "character"), and this interpretation is used, reading "octet" for "character" in the specification. The de-escaped field content can then be interpreted as a NET-UTF string, to be rendered as any other text. 6.3.3 SMTP/X.25 (RFC 1090) The specification for the use of SMTP directly on X.25 and the packet mode of the ISDN should now refer to IS 10646-UTF instead of the reference to IS 8859-1. 6.4 NFS The network file system does not normally place any interpretation on the content of files when used in a Unix-only environment, but often implementations on other operating systems must do some interpretation or conversion of "text" files. Ullmann DRAFT: expires December 12, 1992 [page 10] Internet Draft NET-UTF: International Character Set June 12, 1992 Taking all the existing 7 bit ASCII files to be NET-UTF is a powerful extension of the present day environment, and should provide an immediately effective transition to a universally useful network data base. 7 References [1] David H. Crocker. Standard for the Format of ARPA Internet Text Messages. RFC 822, University of Delaware, August, 1982. [2] International Organization for Standardization. Information technology -- Universal Coded Character Set (UCS). ISO/IEC DIS 10646-1.2, ISO, 26 December, 1991. (Draft, ballot just concluded at this writing) [3] Jon Postel. Simple Mail Transfer Protocol. RFC 821, USC Information Sciences Institute, August, 1982. [4] Robert L. Ullmann. SMTP on X.25. RFC 1090, Prime Computer, February, 1989. 8 Author's Address Robert Ullmann Process Software Corporation 959 Concord Street Framingham, MA 01701 USA Phone: +1 508 879 6994 x226 Email: Ariel@Process.COM This draft expires on or before December 12, 1992.