Network Working Group                            D. Robinson, R. Ullmann
Internet Draft                                      Prime Computer, Inc.
                                                            October 1991


                International character support in SMTP


1. Status of this Memo

   This memo describes an update to the SMTP protocol, and to the format
   of an Internet mail message, to provide support for the character
   sets used in the world's many languages. This memo documents existing
   usage as well as specifying some additional interoperability
   refinements.  It updates RFCs 821, 822, and 1090 [4, 1, 6].

   The Internet is no longer a creature of the United States, much less
   of DARPA (the US Defence Advance Research Projects Agency). It is now
   an international network, and the ability to communicate in any of
   the world languages on an equal footing is an imperative.

   This draft attempts to track the development of ISO 10646 [3], a
   moving target at this writing. The reference citation below is to the
   previous 10646 draft, with failed in the balloting in June of 1991.
   It is therefore expected that this memo will potentially change until
   the publication of IS 10646. Some of the following text refers to
   10646 in the present tense, as if it is IS now; it should be
   understood in this context.

   Distribution of this memo is unlimited.


2. Introduction

   SMTP has been defined over the TCP as a 7 bit text protocol.  While
   there is some dispute as to whether this is actually a restriction,
   the question is now mooted:  Internet mailers are required to pass
   the 8th bit when relaying Internet mail on the TCP. It is understood
   that some time will pass before all Internet Mail Transfer Agents
   (MTAs) can be expected to comply with the new requirement. This
   explicitly modifies the provisions of RFC 821 [4].

   In addition, a conceptual basis and a specific header field are
   described for designating the character set(s) used, when that
   character set is not ASCII-7 or AUC (see below); this provides
   documentation of character sets being used presently in the Internet,
   in the absence of the ISO universal set standard.


Robinson, Ullmann                                               [Page 1]

Internet Draft  International character support in SMTP     October 1991


   Note:  This specification provides 8 bit text. It does NOT provide
   "transparent" binary; in particular, the mail message is still
   represented at the presentation layer (in the ISO model) as an
   ordered set of text lines, of limited length.

   This specification is written specifically for Internet mail transfer
   agents, i.e. those operating as part of the Internet.  It may not be
   directly applicable to the mail transfer agents of other networks; in
   the case of gateways to the Internet, it applies only to the
   interface presented to the Internet, not necessarily to the other
   network.

   Likewise, a host receiving an SMTP mail message for final delivery,
   is subject to this specification only in that it should interpret the
   incoming message as being in AUC, except where explicitly declared
   otherwise as provided below.


3. Motivation

   RFC 821 and 822 defined a mail message as a sequence of lines of
   7-bit ASCII characters.  At the time (1981), ASCII-7 constituted a
   reasonably "lingua franca" for mail messages.

   Much has changed.  Today, SMTP mail is in use by a number of language
   groups in character sets other than ASCII-7.

   A number of European-language systems send mail in 7-bit ECMA-35
   character sets [2] in which specific ASCII-7 characters have been
   replaced with local non-ASCII-7 characters.  ASCII itself has been
   redefined as an 8 bit set (ASCII-8), with code points identical to
   the 8-bit codeset ISO8859/1, which is itself one of a set of 8-bit
   codesets, all in regular use in mail. Several non-European languages
   use 7 and 8-bit multi-octet character sets.

   In addition, the ISO is currently working on specification of a
   32-bit universal character set, ISO10646 [3], and a related proposal
   for ASCII/Universal Character set (AUC) algorithm that would convert
   the 32-bit codes into 1-5 octet sequences.  The AUC code is
   deliberately designed to be useable with existing software, in
   particular, it is mailable through an 8-bit SMTP MTA.

   Today, to support these newer character sets, most of the SMTP MTAs
   being distributed by vendors no longer restrict mail to 7-bits.

   This memo documents the existing usage and adds some refinements to
   improve safe interoperability with older 7-bit SMTP MTAs.

   This memo also proposes a new header field to designate the character


Robinson, Ullmann                                               [Page 2]

Internet Draft  International character support in SMTP     October 1991


   set used in headers during the transition, where not ASCII-7 or AUC.
   It uses the Encoding header and structure [5] to identify the
   character set(s) used in the content of message when not AUC.


4. Terminology: Octets, Characters, and Character sets

   Before proceeding, it is necessary to introduce some definitions.
   For the purposes of this document:

      - An octet is an 8-bit datum, which may contain values 0 through
        255 decimal.

      - A character is a conceptual entity, such as "A" or "o-diaresis"
        (o with 2 dots over it). The ISO working group has a simple
        definition of "coded character": a coded character is something
        that the standard assigns a code to.

      - A (coded) character set is a transformation algorithm which maps
        characters (as defined in UCS, see below) to octets or sequences
        of octets.

   Note that the same character coded in different character sets may
   result in different octets.  For example, the character "o-diaresis",
   code point 246 decimal in UCS: in the Swedish national variant of the
   ECMA-35 character set it is the octet 123, but in IS 8859/1 the octet
   246, and AUC it is the 2 octets 160 246.

   IS 10646 defines (will define) a 32 bit set, UCS-32, with characters
   assigned to integer code points in the range 0. to 4294967295.

   The first 128 code points are ASCII-7. The first 256 will almost
   certainly be IS 8859/1 (ASCII-8).  There is no reservation of octets
   corresponding to C0 or C1. For example, LATIN CAPITAL LETTER A is
   65., or 00 00 00 41 in hex.  The other problem is that the canonical
   form is 4 times the raw size of most alphabetic language files today.
   (Most now using a non-universal 7 or 8 bit code).

   The "C0 committee" defined (working draft) A Transformation Method,
   to address these problems. Codes are mapped through an algorithm to a
   1-5 octet sequence.

   For the purposes of this description, C0-G1 are defined a little
   differently than usual, for simplicity:

       C0       00 to 20
       G0       21 to 7E
       C1       7F to 9F
       G1       A0 to FF


Robinson, Ullmann                                               [Page 3]

Internet Draft  International character support in SMTP     October 1991


   Ranges of the UCS code space are mapped to ranges of AUC as follows:

       UCS-32 (decimal codes)       AUC (hexadecimal octets)

       0.      to 159.              00             to 9F
       160.    to 255.              A0 A0          to A0 FF
       265.    to 16405.            A1 21          to F5 FF
       16406.  to 233005.           F6 21 21       to FB FF FF
       233006. to 4294967295.       FC 21 21 21 21 to FF 59 3C C8 C3

   The octets used in the multi-octet characters are in G0 and in G1.
   The octets in C0 and C1 are mapped 1-1, and always represent
   themselves.

   There are no shifts or locking shifts, a major technical advantage
   over the previous draft of 10646.  Any C0 or C1 character (e.g.
   including SPACE) thus provides a resynchronization point, if an error
   occurs.


5. What is a Line?

   SMTP messages are composed of lines.  A line consists of 0-998 text
   octets ending with a 13 (CR) 10 (LF) not included in the count. As
   defined in RFC 821, this is a "minimum maximum": an MTA MUST accept
   and relay lines of this length, and MAY allow lines of any length.

   Text octets are defined to be 7 (BEL), 8 (BS), 9 (TAB), 11 (VT), 12
   (FF), 27 (ESC), 32 (SPACE), 33-127 (G0), and 160-255 (G1).  These
   octets MAY be included in SMTP mail messages and MUST be relayed by
   SMTP MTAs.

   The following octets are not text octets: 10 (LF), 13 (CR), 138, 141.
   These octets MUST NOT be included in text lines.  10 and 13 are used
   in the line termination sequence.  138 and 141 are the octets 10 and
   13 with the 8th bit set and will cause unexpected results with 7-bit
   SMTP MTAs.

   Some implementations (usually implicitly, as a consequence of
   operating of file system semantics) convert CR and/or LF appearing by
   themselves, i.e.  "within" a line, to an end of line sequence. This
   behavior is (now) valid.

   In particular, SMTP MTAs SHOULD accept lines of message text and of
   commands which are terminated only with 10 (LF).  (This recognizes
   the operational reality that a number of existing SMTP MTAs
   misinterpreted the end-of-line specification.)  Mailers MUST send
   lines of message text and commands terminating with 13 (CR) 10 (LF).

   All other values are discouraged as text octets.  Many are known to

Robinson, Ullmann                                               [Page 4]

Internet Draft  International character support in SMTP     October 1991


   cause difficulties with particular SMTP MTA implementations or with
   particular operating systems. Nevertheless, MTAs SHOULD pass these
   octets wherever possible.


6. Mail Message Format

6.1. Header Field Keywords

   Header field keywords will remain in ASCII-7. This includes all
   keywords, such as "with" in a Received header, or "delivery-report"
   in a Message-Type header, as well as the header keywords themselves.
   There is no expectation that this restriction will ever need to be
   relaxed: user agents may recognize keywords and present them to meet
   any arbitrary user requirements.

6.2. Header Field Bodies

   It is perhaps obvious that unrestricted use of character sets other
   than ASCII-7 and AUC in message headers will be a source of problems.
   However, use of other sets (notably shift-JIS Kanji and ECMA) is
   common current usage.

   There are some headers in which it is obviously useful, and should be
   permitted (e.g. Subject:) and others in which it may cause an MUA
   (Mail User Agent) that does not understand it to mis-parse the
   header. In particular, note that characters outside the ASCII-7 set
   may be "stripped" by non-compliant MTAs to octets that correspond to
   the values of <, >, (, ), and, more painfully, the values of " and \.

   Field bodies of "unstructured" header elements (such as Subject:) MAY
   contain the full range of text octets without any additional
   transformations.

   Field bodies of "structured" header elements (such as To:) which
   apply "\" (92) escaping MUST be careful to apply the same escaping
   not just to "meta" text octets like "<" (60) but also to text octets
   in the 160-255 range which, when stripped to 7 bits, match the meta
   octets, (e.g., 188).

   Refer to RFC 822 for the precise definition of which syntax elements
   require special characters to be quoted, and which prohibit it. It is
   quite un-ambiguous on this subject.

6.3. X-Header-CharSet: Header

   In the beginning (1981 :-), mail messages were in a universal
   character set, ASCII-7.  In the near future, they may again be in a
   universal character set (AUC or something like it).  Today, messages


Robinson, Ullmann                                               [Page 5]

Internet Draft  International character support in SMTP     October 1991


   are in a number of character sets, all sent in octets without any
   character set identification.

   The X-Header-CharSet header field is to be used to identify the
   character set used in the header when not ASCII-7 or AUC.  In the
   absence of an X-Header-CharSet header field, the default character
   set is defined to be ASCII-7 or AUC.

   MTAs MUST NOT interpret the X-Header-CharSet field, or attempt to
   convert from one set to another; the MTA's responsibility is to pass
   the bits unmunged. A gateway may, of course, perform whatever
   transformation is required into the "foreign" environment.  (It
   should also be noted that private point-to-point arrangements between
   consenting MTAs are outside the scope of this, or indeed any,
   standard.)

   The character set selected must by ASCII-7 "conformant", that is, it
   must assign substantially all of the ASCII-7 characters to the same
   code points, to permit keywords and other important elements to be
   represented.

   IS 2022 methods are conformant, with the shift (back) to ASCII-7
   wherever needed. ECMA-35 sets are borderline since they replace
   ASCII-7 code points, but should usually not be a problem. (For
   example, the German variant replaces the "@" character, a very strict
   interpretation might try to make Internet addresses un-representable.
   This is, of course, silly.)

   The content of the X-Header-CharSet field is a single token in
   ASCII-7.  Appendix A is a partial list of values for the
   X-Header-CharSet field.

   The field has an X- name because it is a temporary expedient: as soon
   as the International Standard AUC is defined, use of any other set in
   headers is (to be) deprecated.


7. Character Set Encodings

   The content of the message may be encoded by a number of different
   character sets (as explained above, character sets are defined to be
   encodings of the IS 10646 UCS code point space).

       Encoding: Text

   is defined to be text in AUC.

       Encoding: <charset> Text


Robinson, Ullmann                                               [Page 6]

Internet Draft  International character support in SMTP     October 1991


   is a encoding (transform) of UCS to the code point assignments
   defined by <charset>.

   It is perhaps amusing to note that it is not necessary to have the
   UCS code points defined to have a complete definition of an encoding
   such as ISO_8859-1 Text.

   For example, messages in Japanese Kanji in common use now, should
   use:

       Encoding: ISO_2022 (JIS Kanji) Text

   The (optional) comment is useful, to explain to users not being
   presented the actual glyphs, that the incomprehensible text following
   is really Kanji. The comment MUST NOT be interpreted by any program.
   (Particularily in the case of ISO_2022, where the code sets are
   selected by the standard escape sequences).

   This extends, of course, to multiple parts and other nested encodings
   as described by RFC 1154.


A. List of Character set Identifiers

   The following is a list of character set identifiers that may be used
   as the value of the header field X-Header-CharSet:  and in the
   Encoding: header. All of these keywords are added to the defined set
   for the Encoding: header.

   ISO_8859-1      IS 8859/1, commonly called Latin-1.

   ISO_8859-2      IS 8859/2, Latin-2

   ISO_8859-3      IS 8859/3, Latin-3

   ISO_8859-4      IS 8859/4, Latin-4

   ISO_8859-5      IS 8859/5, Cyrillic

   ISO_8859-6      IS 8859/6, Arabic

   ISO_8859-7      IS 8859/7, Greek

   ISO_8859-8      IS 8859/8, Hebrew

   ISO_8859-9      IS 8859/9, Latin-5

   ECMA_35-DK      ECMA replaced-code point set for (e.g.) Denmark.
                   Others in the same pattern: ECMA_35-NO, etc.


Robinson, Ullmann                                               [Page 7]

Internet Draft  International character support in SMTP     October 1991


   ISO_2022        ISO 2022

   IBM_CP437       I.B.M. code page 437. For example, others in the same
                   pattern.  It probably is not a good idea to use the
                   EBCDIC based pages (e.g. 274, 500, etc.)

   APPLE_MAC       Apple Computer's Macintosh character set

   Others...       This list makes no claim to be complete at this
                   draft.


References

[1]   David H. Crocker.
      Standard for the Format of ARPA Internet Text Messages.
      RFC 822, University of Delaware, August, 1982.

[2]
      European Computer Manufacturers Association standard 35.
      [citation tba with proper title and author].

[3]   International Organization for Standardization.
      Information technology -- Universal Coded Character Set (UCS).
      ISO/IEC DIS 10646, ISO, November, 1990.
      (Draft, defeated in June 1991 ballot).

[4]   Jon Postel.
      Simple Mail Transfer Protocol.
      RFC 821, USC Information Sciences Institute, August, 1982.

[5]   David Robinson, Robert L. Ullmann.
      Encoding Header Field for Internet Messages.
      RFC 1154, Prime Computer, April, 1990.

[6]   Robert L. Ullmann.
      SMTP on X.25.
      RFC 1090, Prime Computer, February, 1989.


Author's Address

      David Robinson 10-30
      Prime Computer, Inc.
      500 Old Connecticut Path
      Framingham, MA 01701
      USA


Robinson, Ullmann                                               [Page 8]

Internet Draft  International character support in SMTP     October 1991


      Phone: +1 508 620 2800 x1774
      Email: DRB@Relay.Prime.COM

      Robert Ullmann 10-30
      Prime Computer, Inc.
      500 Old Connecticut Path
      Framingham, MA 01701
      USA

      Phone: +1 508 620 2800 x1736
      Email: Ariel@Relay.Prime.COM


Robinson, Ullmann                                               [Page 9]