Network Working Group D. Robinson, R. Ullmann Internet Draft Prime Computer, Inc. October 1991 International character support in SMTP 1. Status of this Memo This memo describes an update to the SMTP protocol, and to the format of an Internet mail message, to provide support for the character sets used in the world's many languages. This memo documents existing usage as well as specifying some additional interoperability refinements. It updates RFCs 821, 822, and 1090 [4, 1, 6]. The Internet is no longer a creature of the United States, much less of DARPA (the US Defence Advance Research Projects Agency). It is now an international network, and the ability to communicate in any of the world languages on an equal footing is an imperative. This draft attempts to track the development of ISO 10646 [3], a moving target at this writing. The reference citation below is to the previous 10646 draft, with failed in the balloting in June of 1991. It is therefore expected that this memo will potentially change until the publication of IS 10646. Some of the following text refers to 10646 in the present tense, as if it is IS now; it should be understood in this context. Distribution of this memo is unlimited. 2. Introduction SMTP has been defined over the TCP as a 7 bit text protocol. While there is some dispute as to whether this is actually a restriction, the question is now mooted: Internet mailers are required to pass the 8th bit when relaying Internet mail on the TCP. It is understood that some time will pass before all Internet Mail Transfer Agents (MTAs) can be expected to comply with the new requirement. This explicitly modifies the provisions of RFC 821 [4]. In addition, a conceptual basis and a specific header field are described for designating the character set(s) used, when that character set is not ASCII-7 or AUC (see below); this provides documentation of character sets being used presently in the Internet, in the absence of the ISO universal set standard. Robinson, Ullmann [Page 1] Internet Draft International character support in SMTP October 1991 Note: This specification provides 8 bit text. It does NOT provide "transparent" binary; in particular, the mail message is still represented at the presentation layer (in the ISO model) as an ordered set of text lines, of limited length. This specification is written specifically for Internet mail transfer agents, i.e. those operating as part of the Internet. It may not be directly applicable to the mail transfer agents of other networks; in the case of gateways to the Internet, it applies only to the interface presented to the Internet, not necessarily to the other network. Likewise, a host receiving an SMTP mail message for final delivery, is subject to this specification only in that it should interpret the incoming message as being in AUC, except where explicitly declared otherwise as provided below. 3. Motivation RFC 821 and 822 defined a mail message as a sequence of lines of 7-bit ASCII characters. At the time (1981), ASCII-7 constituted a reasonably "lingua franca" for mail messages. Much has changed. Today, SMTP mail is in use by a number of language groups in character sets other than ASCII-7. A number of European-language systems send mail in 7-bit ECMA-35 character sets [2] in which specific ASCII-7 characters have been replaced with local non-ASCII-7 characters. ASCII itself has been redefined as an 8 bit set (ASCII-8), with code points identical to the 8-bit codeset ISO8859/1, which is itself one of a set of 8-bit codesets, all in regular use in mail. Several non-European languages use 7 and 8-bit multi-octet character sets. In addition, the ISO is currently working on specification of a 32-bit universal character set, ISO10646 [3], and a related proposal for ASCII/Universal Character set (AUC) algorithm that would convert the 32-bit codes into 1-5 octet sequences. The AUC code is deliberately designed to be useable with existing software, in particular, it is mailable through an 8-bit SMTP MTA. Today, to support these newer character sets, most of the SMTP MTAs being distributed by vendors no longer restrict mail to 7-bits. This memo documents the existing usage and adds some refinements to improve safe interoperability with older 7-bit SMTP MTAs. This memo also proposes a new header field to designate the character Robinson, Ullmann [Page 2] Internet Draft International character support in SMTP October 1991 set used in headers during the transition, where not ASCII-7 or AUC. It uses the Encoding header and structure [5] to identify the character set(s) used in the content of message when not AUC. 4. Terminology: Octets, Characters, and Character sets Before proceeding, it is necessary to introduce some definitions. For the purposes of this document: - An octet is an 8-bit datum, which may contain values 0 through 255 decimal. - A character is a conceptual entity, such as "A" or "o-diaresis" (o with 2 dots over it). The ISO working group has a simple definition of "coded character": a coded character is something that the standard assigns a code to. - A (coded) character set is a transformation algorithm which maps characters (as defined in UCS, see below) to octets or sequences of octets. Note that the same character coded in different character sets may result in different octets. For example, the character "o-diaresis", code point 246 decimal in UCS: in the Swedish national variant of the ECMA-35 character set it is the octet 123, but in IS 8859/1 the octet 246, and AUC it is the 2 octets 160 246. IS 10646 defines (will define) a 32 bit set, UCS-32, with characters assigned to integer code points in the range 0. to 4294967295. The first 128 code points are ASCII-7. The first 256 will almost certainly be IS 8859/1 (ASCII-8). There is no reservation of octets corresponding to C0 or C1. For example, LATIN CAPITAL LETTER A is 65., or 00 00 00 41 in hex. The other problem is that the canonical form is 4 times the raw size of most alphabetic language files today. (Most now using a non-universal 7 or 8 bit code). The "C0 committee" defined (working draft) A Transformation Method, to address these problems. Codes are mapped through an algorithm to a 1-5 octet sequence. For the purposes of this description, C0-G1 are defined a little differently than usual, for simplicity: C0 00 to 20 G0 21 to 7E C1 7F to 9F G1 A0 to FF Robinson, Ullmann [Page 3] Internet Draft International character support in SMTP October 1991 Ranges of the UCS code space are mapped to ranges of AUC as follows: UCS-32 (decimal codes) AUC (hexadecimal octets) 0. to 159. 00 to 9F 160. to 255. A0 A0 to A0 FF 265. to 16405. A1 21 to F5 FF 16406. to 233005. F6 21 21 to FB FF FF 233006. to 4294967295. FC 21 21 21 21 to FF 59 3C C8 C3 The octets used in the multi-octet characters are in G0 and in G1. The octets in C0 and C1 are mapped 1-1, and always represent themselves. There are no shifts or locking shifts, a major technical advantage over the previous draft of 10646. Any C0 or C1 character (e.g. including SPACE) thus provides a resynchronization point, if an error occurs. 5. What is a Line? SMTP messages are composed of lines. A line consists of 0-998 text octets ending with a 13 (CR) 10 (LF) not included in the count. As defined in RFC 821, this is a "minimum maximum": an MTA MUST accept and relay lines of this length, and MAY allow lines of any length. Text octets are defined to be 7 (BEL), 8 (BS), 9 (TAB), 11 (VT), 12 (FF), 27 (ESC), 32 (SPACE), 33-127 (G0), and 160-255 (G1). These octets MAY be included in SMTP mail messages and MUST be relayed by SMTP MTAs. The following octets are not text octets: 10 (LF), 13 (CR), 138, 141. These octets MUST NOT be included in text lines. 10 and 13 are used in the line termination sequence. 138 and 141 are the octets 10 and 13 with the 8th bit set and will cause unexpected results with 7-bit SMTP MTAs. Some implementations (usually implicitly, as a consequence of operating of file system semantics) convert CR and/or LF appearing by themselves, i.e. "within" a line, to an end of line sequence. This behavior is (now) valid. In particular, SMTP MTAs SHOULD accept lines of message text and of commands which are terminated only with 10 (LF). (This recognizes the operational reality that a number of existing SMTP MTAs misinterpreted the end-of-line specification.) Mailers MUST send lines of message text and commands terminating with 13 (CR) 10 (LF). All other values are discouraged as text octets. Many are known to Robinson, Ullmann [Page 4] Internet Draft International character support in SMTP October 1991 cause difficulties with particular SMTP MTA implementations or with particular operating systems. Nevertheless, MTAs SHOULD pass these octets wherever possible. 6. Mail Message Format 6.1. Header Field Keywords Header field keywords will remain in ASCII-7. This includes all keywords, such as "with" in a Received header, or "delivery-report" in a Message-Type header, as well as the header keywords themselves. There is no expectation that this restriction will ever need to be relaxed: user agents may recognize keywords and present them to meet any arbitrary user requirements. 6.2. Header Field Bodies It is perhaps obvious that unrestricted use of character sets other than ASCII-7 and AUC in message headers will be a source of problems. However, use of other sets (notably shift-JIS Kanji and ECMA) is common current usage. There are some headers in which it is obviously useful, and should be permitted (e.g. Subject:) and others in which it may cause an MUA (Mail User Agent) that does not understand it to mis-parse the header. In particular, note that characters outside the ASCII-7 set may be "stripped" by non-compliant MTAs to octets that correspond to the values of <, >, (, ), and, more painfully, the values of " and \. Field bodies of "unstructured" header elements (such as Subject:) MAY contain the full range of text octets without any additional transformations. Field bodies of "structured" header elements (such as To:) which apply "\" (92) escaping MUST be careful to apply the same escaping not just to "meta" text octets like "<" (60) but also to text octets in the 160-255 range which, when stripped to 7 bits, match the meta octets, (e.g., 188). Refer to RFC 822 for the precise definition of which syntax elements require special characters to be quoted, and which prohibit it. It is quite un-ambiguous on this subject. 6.3. X-Header-CharSet: Header In the beginning (1981 :-), mail messages were in a universal character set, ASCII-7. In the near future, they may again be in a universal character set (AUC or something like it). Today, messages Robinson, Ullmann [Page 5] Internet Draft International character support in SMTP October 1991 are in a number of character sets, all sent in octets without any character set identification. The X-Header-CharSet header field is to be used to identify the character set used in the header when not ASCII-7 or AUC. In the absence of an X-Header-CharSet header field, the default character set is defined to be ASCII-7 or AUC. MTAs MUST NOT interpret the X-Header-CharSet field, or attempt to convert from one set to another; the MTA's responsibility is to pass the bits unmunged. A gateway may, of course, perform whatever transformation is required into the "foreign" environment. (It should also be noted that private point-to-point arrangements between consenting MTAs are outside the scope of this, or indeed any, standard.) The character set selected must by ASCII-7 "conformant", that is, it must assign substantially all of the ASCII-7 characters to the same code points, to permit keywords and other important elements to be represented. IS 2022 methods are conformant, with the shift (back) to ASCII-7 wherever needed. ECMA-35 sets are borderline since they replace ASCII-7 code points, but should usually not be a problem. (For example, the German variant replaces the "@" character, a very strict interpretation might try to make Internet addresses un-representable. This is, of course, silly.) The content of the X-Header-CharSet field is a single token in ASCII-7. Appendix A is a partial list of values for the X-Header-CharSet field. The field has an X- name because it is a temporary expedient: as soon as the International Standard AUC is defined, use of any other set in headers is (to be) deprecated. 7. Character Set Encodings The content of the message may be encoded by a number of different character sets (as explained above, character sets are defined to be encodings of the IS 10646 UCS code point space). Encoding: Text is defined to be text in AUC. Encoding: Text Robinson, Ullmann [Page 6] Internet Draft International character support in SMTP October 1991 is a encoding (transform) of UCS to the code point assignments defined by . It is perhaps amusing to note that it is not necessary to have the UCS code points defined to have a complete definition of an encoding such as ISO_8859-1 Text. For example, messages in Japanese Kanji in common use now, should use: Encoding: ISO_2022 (JIS Kanji) Text The (optional) comment is useful, to explain to users not being presented the actual glyphs, that the incomprehensible text following is really Kanji. The comment MUST NOT be interpreted by any program. (Particularily in the case of ISO_2022, where the code sets are selected by the standard escape sequences). This extends, of course, to multiple parts and other nested encodings as described by RFC 1154. A. List of Character set Identifiers The following is a list of character set identifiers that may be used as the value of the header field X-Header-CharSet: and in the Encoding: header. All of these keywords are added to the defined set for the Encoding: header. ISO_8859-1 IS 8859/1, commonly called Latin-1. ISO_8859-2 IS 8859/2, Latin-2 ISO_8859-3 IS 8859/3, Latin-3 ISO_8859-4 IS 8859/4, Latin-4 ISO_8859-5 IS 8859/5, Cyrillic ISO_8859-6 IS 8859/6, Arabic ISO_8859-7 IS 8859/7, Greek ISO_8859-8 IS 8859/8, Hebrew ISO_8859-9 IS 8859/9, Latin-5 ECMA_35-DK ECMA replaced-code point set for (e.g.) Denmark. Others in the same pattern: ECMA_35-NO, etc. Robinson, Ullmann [Page 7] Internet Draft International character support in SMTP October 1991 ISO_2022 ISO 2022 IBM_CP437 I.B.M. code page 437. For example, others in the same pattern. It probably is not a good idea to use the EBCDIC based pages (e.g. 274, 500, etc.) APPLE_MAC Apple Computer's Macintosh character set Others... This list makes no claim to be complete at this draft. References [1] David H. Crocker. Standard for the Format of ARPA Internet Text Messages. RFC 822, University of Delaware, August, 1982. [2] European Computer Manufacturers Association standard 35. [citation tba with proper title and author]. [3] International Organization for Standardization. Information technology -- Universal Coded Character Set (UCS). ISO/IEC DIS 10646, ISO, November, 1990. (Draft, defeated in June 1991 ballot). [4] Jon Postel. Simple Mail Transfer Protocol. RFC 821, USC Information Sciences Institute, August, 1982. [5] David Robinson, Robert L. Ullmann. Encoding Header Field for Internet Messages. RFC 1154, Prime Computer, April, 1990. [6] Robert L. Ullmann. SMTP on X.25. RFC 1090, Prime Computer, February, 1989. Author's Address David Robinson 10-30 Prime Computer, Inc. 500 Old Connecticut Path Framingham, MA 01701 USA Robinson, Ullmann [Page 8] Internet Draft International character support in SMTP October 1991 Phone: +1 508 620 2800 x1774 Email: DRB@Relay.Prime.COM Robert Ullmann 10-30 Prime Computer, Inc. 500 Old Connecticut Path Framingham, MA 01701 USA Phone: +1 508 620 2800 x1736 Email: Ariel@Relay.Prime.COM Robinson, Ullmann [Page 9]