draft X.400 use of extended character sets Apr 92 X.400 use of extended character sets Thu Jun 18 14:04:25 MET DST 1992 Harald Tveit Alvestrand SINTEF DELAB Harald.Alvestrand@delab.sintef.no Status of this Memo This draft document is being circulated for comment. If consensus is reached it may be submitted to the RFC editor as a Proposed Standard protocol specificiation, for use in X.400 in the Internet. Please send comments to the author, or to the RARE WG-MSG list . The following text is required by the Internet-draft rules: This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Alvestrand Expires Dec 18 1992 [Page 1] draft X.400 use of extended character sets Apr 92 1. Introduction Since 1988, X.400 has had the capacity for carrying a large number of different character sets in a message by using the body part "GeneralText" defined by ISO/IEC 10021-7. Since 1992, the Internet also has the means of passing around messages containing multiple character sets, by using the mechanism defined in RFC-MIME. This RFC defines a suggested method of using "GeneralText" in order to harmonize as much as possible the usage of this body part. A companion RFC, RFC-X4MAP, describes the exact way to put MIME structured messages into X.400/88 messages. 2. General principles 2.1. Goals The target of this memo is to define a way of using existing standards to achieve: (1) in the short term, a standard for sending E-mail in the European languages (Latin letters with European accents, Greek and Cyrillic) (2) in the medium term, extending this to cover the Hebrew and Arabic character sets (3) in the long term, opening up true international E-mail by allowing the full character set specified in ISO-10646 (currently a DIS) to be used. The author believes that this document gives a specification that can easily accomodate the use of any character set in the ISO registry, and, by giving guidance rules for choosing character sets, will help interworking. Alvestrand Expires Dec 18 1992 [Page 2] draft X.400 use of extended character sets Apr 92 2.2. Families of character sets 2.2.1. ISO 6937/T.61 ISO 6937 is a code technique used and recommended in T.51 and T.101 (Teletex and Videotex service) and in X.500, providing a repertoire of 333 characters from the Latin script by use of non- spacing diacritical marks. It corresponds closely to CCITT recommendation T.61. The problem with that technique is that the character stream comes in two modes, i.e some characters are coded with one byte and some with two (composite characters). This makes information processing systems such as an E-mail UA or GW more complex. It is also not extensible to other languages like Korean or Chinese, or even Greek, without invoking the character set switching techniques of ISO 2022. << ISO 6937 is revised in 1991 version to lose possibility of Greek >> 2.2.2. ISO 8859 ISO 8859 defines a set of character sets, each suitable for use in some group of languages. Each character in ISO 8859 is coded in a single byte. There are currently 9 parts of ISO 8859, plus a "supplementary" set, registered as ISO IR 154. All languages using single-byte characters can be written in one or another of the ISO 8859 sets. There are sets covering Greek, Hebrew and Arabic. All the ISO 8859 sets include US ASCII as a subset. All use 8 bits. ISO 8859 is regarded by many as a solution; for instance, the X system now comes with ISO-8859-1 as the "standard" character set, with the possibility of specifying others. But since the same applications often do not support character set switching within text, it is problematic to use these in a truly multilingual environment. Alvestrand Expires Dec 18 1992 [Page 3] draft X.400 use of extended character sets Apr 92 It turns out to work fine, however, if the second language is English, since this can be written in all ISO 8859 sets. The parts 3 and 4 have not seen wide acceptance, and it is expected that they will be discarded. They should therefore not be used. 2.2.3. ISO 10646 At the moment of writing, ISO 10646 is a draft International Standard. It is basically a 32-bit character set, with all of the currently used characters being numbered by the first 16 bits, leaving some room for expansion. It is not possible to use ISO 10646 as a normal character set, because it does not conform to the rules for usage of byte values set down in ISO 2022 and other places; it uses the "control space" for (parts of) graphic character codes. There are proposals for coding schemes that allow an ISO 10646 data stream to be "compressed" by using only 1 byte for ASCII characters and more bytes for characters in less world-wide usage. This is however one of the most hotly debated parts of the standard, so it is unsafe to base any implementation on this currently. This character set will become very important in the future, but cannot be used before the standard becomes stable. Since it is not possible to carry this in GeneralText, new body parts must be defined to carry this in X.400 messages. 2.3. Body parts that can be used in X.400 At the moment, no established way of transferring a full set of characters in X.400-based E-mail exists. In the future, it is likely that a new body part, based in ISO 10646, will be available; it is, however, dangerous to try to specify this body part before ISO 10646 is final. In the short term, the deployed and available body parts are: Alvestrand Expires Dec 18 1992 [Page 4] draft X.400 use of extended character sets Apr 92 (1) IA5Text (2) For X.400/84: ISO6937Text and Teletex (3) For X.400/88: GeneralText IA5Text is the method of choice for E-mail that contains only characters from IA5 (equivalent to ASCII). The ISO6937Text body part is defined in the ISO DIS documents corresponding to X.400(84) [MOTIS-86]; these never became a standard, so they are now quite difficult to find. It is in principle limited to using text that can be presented in ISO 6937, but since ISO 6937 refers to the ISO 2022 method of changing character sets, it is theoretically possible to use any ISO registered character set, but there is no facility for announcing the character sets used. This makes interworking with equipment that does not support the same character sets complex. It is still, however, the only body part suitable for carrying non-paginated text in non-basic character sets in X.400(84). Teletex, which is identical in all versions of the X.400 standard, has the same problem of implicit ISO6937, but has the added problem that it also specifies a page format, with, for instance, a left margin of 5 character positions. This is often not desirable. The details of Teletex are specified in recommendation T.51 and its relatives. GeneralText is defined in ISO 10021-8, the part of [MOTIS] that corresponds to CCITT recommendation [X.420]. It is an Extended body part, so no modification to CCITT implementations is needed to carry it. GeneralText is suitable for interchange, since it has got proper announcement facilities. It can use any number of character sets, and announces them both in the Encoded Information Types of the X.400 envelope and the parameters of the body part. We recommend this body part for carrying unformatted text in X.400/88. Alvestrand Expires Dec 18 1992 [Page 5] draft X.400 use of extended character sets Apr 92 3. GUIDELINES FOR THE GENERATION OF GENERALTEXT 3.1. Tutorial on GeneralText A GeneralText message is a byte stream that contains characters and character switching sequences according to [ISO 2022]. There are 4 graphic character sets active at any time in a GeneralText message, called G0, G1, G2 and G3. In addition, there are 2 control character sets, called C0 and C1. At any moment, one of the sets G0-G4 is active in code positions 2/0 to 7/14, and another is active in code positions 10/0 to 15/15. The setting is achieved by so-called "locking shift" sequences. Single characters from the non-active sets may be invoked by the use of "single shift" sequences. The control character sets always occupy the code positions 0/0 to 1/15 (C0) and 8/0 to 9/15 (C1). The character sets currently active as G0-G3 and C0-C1 may be changed using "character set designating sequences". At the beginning of a GeneralText message, one must always assume that set 6 (US-ASCII) is active as G0, shifted into the lower half, that set is active as C0, and that no G1-G3 or C1 set is invoked. This is specified in the definition of "GeneralString" in [X.208], the definition of ASN.1. If this is not a suitable initial state, a message must always start with the necessary announcers and escape sequences to designate and invoke the character sets that are actually used. The character sets in use may be changed later in the message by use of escape sequences. The parameters of a GeneralText message always list all the character sets used, by quoting their ISO reference numbers. It is impossible to use a character set not registered with ISO in a GeneralText message. It is also impossible to decide on the true meaning of a byte in a Alvestrand Expires Dec 18 1992 [Page 6] draft X.400 use of extended character sets Apr 92 GeneralText message without scanning the whole message for shift and escape sequences. 3.2. Recommendation for choosing character sets RECOMMENDATION: When the text to be rendered is representable in one of the character sets of ISO-8859, the G0 set should be set to ISO 646 International Reference Version (1991), also called US-ASCII, ISO-IR-6. The older character set ISO-IR-2, ISO 646 IRV(1983), should NOT be used. The G1 set should be set to the relevant ISO-8859 part. G2 and G3 are not used. This corresponds to the first level of ISO 4873 usage. For the currently defined parts of ISO 8859, the character set designations are (relative to ISO 8859:1987): Part ISO IR name Escape sequence Remarks for G1 use 1 ISO-IR-100 Esc 2D 41 West Europe 2 ISO-IR-101 Esc 2D 42 East European 3 ISO-IR-109 Esc 2D 43 4 ISO-IR-110 Esc 2D 44 5 ISO-IR-144 Esc 2D 4C Cyrillic 6 ISO-IR-127 Esc 2D 47 Arabic 7 ISO-IR-126 Esc 2D 46 Greek 8 ISO-IR-138 Esc 2D 48 Hebrew 9 ISO-IR-148 Esc 2D 4D Baltic, Turkish The G1 set should be permanently shifted into the upper half of the code page. When the text is not representable in one of the ISO-8859 character sets, the following rules may be applied: Alvestrand Expires Dec 18 1992 [Page 7] draft X.400 use of extended character sets Apr 92 (1) If any Latin characters are used, keep IA5 as the G0 set. (2) If a mainstream character set is used (Greek, Cyrillic, Hebrew, Arabic), designate this as the G1 character set, and permanently shift this into the upper half of the code page (LS1R). This also applies to multi-byte character sets (3) If occasional extensions to a character set that is basically Latin occur (like accents, national variants and so on), and these are available in a single character set, designate the relevant set as G2 and use single shift (SS2) to invoke characters from this character set. The ISO 8859 supplementary set, ISO-IR-154, is recommended for this purpose. This corresponds to the ISO 4873 "second level" application. (4) If two non-Latin character sets are used, the second should be designated as G3, and shifted into the upper half of the code page by the use of Locking Shift 3 Right (LS3R). This corresponds to the ISO 4873 "third level" application. (5) If avoidable, use of character sets with floating accents, like ISO 6937, should be avoided. (6) The shifts changing the lower half of the code table (SI/SO, LS2 and LS3) should NOT be used. RATIONALE: Keeping the G0 set reserved for ASCII will ensure that text in ASCII has the same bit representation always. The use of the upper code page for other scripts ensures that both text in these languages and text of this type mixed with English can be represented without the use of shift sequences. If the language and/or content of a text is completely unknown, chapter (n) gives an algorithm that may be used to decide upon the character sets. This might, for instance, be suitable for use at automatic mail gateways. Alvestrand Expires Dec 18 1992 [Page 8] draft X.400 use of extended character sets Apr 92 4. GUIDELINES FOR THE RENDERING OF GENERALTEXT As a basic rule, one should NOT assume that any of the rules above are followed. An user agent capable of rendering GeneralText should: (1) ALWAYS be able to identify and render characters in IA5, no matter how they are designated and invoked. (2) ALWAYS be able to identify and render characters in the "native" character sets, no matter how they are designated and invoked. (3) ALWAYS indicate the presence of characters that cannot be adequately represented on the current output device. (4) NEVER render a character in an unknown or unrepresentable character set by displaying the character in the same bit position in the native character set. (5) PREFERABLY be able to identify and render characters that are the same as characters in the "native" character sets, even though they are designated and invoked as part of other character sets. This applies in particular to the "invariant" part of ISO 8859, parts 1 through 6. (6) PREFERABLY be able to combine the floating accents of ISO 6937 with their base characters for suitable rendering using the capabilities of the current output device. (7) PREFERABLY be able to display text both in a mode using fallbacks for nonrenderable characters and in a mode designating nonrenderable characters as such. (8) PREFERABLY be able to save the content of a GeneralText message to a file or other suitable media, saving all character set information, for later processing by other means. It is not illegal to render the character set information into a different format; however, it should be noted that it is easy to lose vital information if the format chosen for representing character sets does not offer the possibility of referencing all character sets in the ISO Alvestrand Expires Dec 18 1992 [Page 9] draft X.400 use of extended character sets Apr 92 registry of character sets. 5. RECOMMENDATION FOR SELECTION OF CHARACTER SETS 5.1. Algorithm for selection of character sets When one has text in which characters from several character sets occurs, and wants to process this into a GeneralText document, it is often hard to guess right at the character sets to select. The following paragraphs give an algorithm that can be started at the beginning of a message, and at the end of it, return a set of character sets that can be used as G0..G3 character sets, OR an indication that the task is impossible. The detailed specifications have been machine-generated from Keld Jrn Simonsen's character set tables. VARIABLES: UsedSets The set of character sets that MUST be used for this message UsableSets The set of character sets that MAY be used for this message. Each set also contains a counter for each character position. PossibleSets The set of all the character sets known to be usable in the destination format. ALGORITHM: 1) Add IA5 (ISO-IR-6) to the UsedSets (as G0). 2) Get the next character of the text. If the text is completely analyzed, go to FINISHED 3) If it is in the UsedSets, go to 2). 4) Find the set of character sets from PossibleSets in which the character occurs. If it does not occur in any, report failure. Alvestrand Expires Dec 18 1992 [Page 10] draft X.400 use of extended character sets Apr 92 5) If it is in a single character set in PossibleSets only, add this set to UsedSets, and go to 2). 6) If it is in more than one character set, add these to PossibleSets (if not already present), and increment the counter for that character in all the sets. Go to 2). FINISHED) 1) (FINAL SELECTION) Remove any character set in UsedSets from PossibleSets. Zero the counters for any character in PossibleSets that also occurs in UsedSets. WHILE (more characters left) Select one character set and move it from PossibleSets to UsedSets. Zero the counters for all characters in this set in the other PossibleSets. END WHILE This step can be "tuned" any way you want, for instance by choosing the character sets most likely to be understood at the destination first, choosing the character sets covering the most characters first, avoiding multi-byte character sets as long as possible, or any other scheme suitable for the application. 5.2. WHAT TO DO ON FAILURE Failure will occur in this schema if a character is found that is not in the PossibleSets. It may then be handled in one of the following ways: (1) Replace the character with the SUB control character (2) Replace the character with Keld Simonsen Mnemonics. This is a reversible transformation as long as the recipient is aware that it has been used, but requires passing out-of-band information to indicate this. (3) Replace the lost characters with any suitable fallback or mnemonic scheme intended for human understanding (4) Bounce the message/refuse the conversion/give up. Alvestrand Expires Dec 18 1992 [Page 11] draft X.400 use of extended character sets Apr 92 The action to be taken may be different based on the percentage of "lost" characters. If the message has "controls" like "conversion with loss prohibited", only the last possibility may be used. 5.3. SUGGESTED GUIDELINES FOR USE OF THE ALGORITHM There are 2 steps in the algorithm above that are left for local judgement: (1) Selection of the sets to appear in PossibleSets. (2) The algorithm for deciding which character set to select in step 9. In the context of generating X.400 GeneralText messages, the following is suggested: Sets in PossibleSets: ISO-IR-6 Esc 28 42 (G0) US-ASCII, IA5, ISO646 ISO-IR-100 Esc 2D 41 (G1) ISO-8859-1 West Europe ISO-IR-101 Esc 2D 42 (G1) ISO-8859-2 Central/Eastern Europe ISO-IR-109 Esc 2D 43 (G1) ISO-8859-3 Mediterranean countries ISO-IR-110 Esc 2D 44 (G1) ISO-8859-4 "latin 4" ISO-IR-144 Esc 2D 4C (G1) ISO-8859-5 Cyrillic ISO-IR-127 Esc 2D 47 (G1) ISO-8859-6 Arabic ISO-IR-126 Esc 2D 46 (G1) ISO-8859-7 Greek ISO-IR-138 Esc 2D 48 (G1) ISO-8859-8 Hebrew ISO-IR-148 Esc 2D 4D (G1) ISO-8859-9 Baltic/Nordic/Turkish The following multi-byte character sets are recommended: ISO-IR-87 (Japanese JIS C6226-1983) Esc 24 29 42 (G1) ISO-IR-149 (Korean KS C 5601-1989) Esc 24 29 43 (G1) ISO-IR-58 (Chinese GB 2312-80) Esc 24 29 41 (G1) It is a STRONG recommendation that character sets not listed above, which do not add any new characters to the total set of characters given by the character sets above, should NOT be used in X.400 interchange. Algorithm for selecting character sets: Start at the top of the list above, and add each set only if it is needed. 5.4. Selecting a character set based on language If the most common language of the environment in which it is used is known, the following character sets are recommended. The table of Latin-script languages is based on work by Johan van Wingen. The others are best guesses by the author. Again, these are intended for guidance, not enforcement; there is considerable prestige atttached to such recommendations in other contexts, and it is therefore likely that each language group will make appropriate decisions on this subject. The table below is intended as a compilation of existing knowledge, again on the principle that it is better to say something than to say nothing. The language codes come from ISO 639. Language 1 2 3 4 5 6 7 ------------------------------------------------------------ sq Albanian X X X X eu Basque X X X br Breton X X hr Croatian X cs Czech X da Danish X eo Esperanto X X et Estonian X fo Faeroese X fi Finnish X X X X X fy Frisian X X ?? Gaelic X X gl Galician X X X Alvestrand Expires Dec 18 1992 [Page 13] draft X.400 use of extended character sets Apr 92 de German X X X hu Hungarian X is Icelandic X ga Irish X X X X it Italian X X lv Latvian X X lt Lithuanian X X mt Maltese X no Norwegian X X pl Polish X pt Portuguese X ?? Rhaetian X X ro Romanian X sk Slovak X sl Slovenian X X ?? Sorbian X es Spanish X X X sv Swedish X X X tr Turkish X X Explanation of character set codes ---------------------------------------- 1: ISO_8859-1:1987 2: ISO_8859-2:1987 3: ISO_8859-3:1988 4: ISO_8859-4:1988 5: ISO_8859-9:1989 6: ISO_8859-supp 7: ISO_8859-2:1987 and ISO_8859-supp Other languages for which appropriate character sets are known are listed in the table below. Language Character set ar Arabic ISO-8859-6 el Greek ISO- 8859-7 iw Hebrew ISO-8859-8 ja Japanese ISO-IR-87 (Japanese JIS C6226-1983) ko Korean ISO-IR-149 (Korean KS C 5601-1989) ru Russian ISO-8859-5 zh Chinese ISO-IR-58 (Chinese GB 2312-80) Additional entries in this table are welcome! Alvestrand Expires Dec 18 1992 [Page 14] draft X.400 use of extended character sets Apr 92 Some languages have only one or a few characters missing. These are listed below. Language Character set Missing Sami ISO-8859-4 I with diaeresis N with acute kl Greenlandic ISO-8859-1 I with tilde K with cedilla U with tilde cy Welsh ISO-8859-1 W with acute W with grave W with diaeresis Y with grave Y with circumflex nl Dutch ISO-8859-1 Ligature IJ af Afrikaans ISO-8859-1 N preceded by apostrophe fr French ISO-8859-1 Ligature OE ca Catalan ISO-8859-1 L with middle dot Some of these characters may not be in common usage under current practice. For French, Dutch, Catalan and Afrikaans, the character set ISO 6937-2, which uses floating diacritical marks, may be used. Languages for which the author does NOT know the proper character set include: aa Afar ab Abkhazian am Amharic as Assamese ay Aymara az Azerbaijani ba Bashkir be Byelorussian bg Bulgarian bh Bihari bi Bislama bn Bengali bo Tibetan co Corsican Alvestrand Expires Dec 18 1992 [Page 15] draft X.400 use of extended character sets Apr 92 dz Bhutani fa Persian fj Fiji gd Scots gn Guarani gu Gujarati ha Hausa hi Hindi hy Armenian ia Interlingua ie Interlingue ik Inupiak in Indonesian ji Yiddish jw Javanese ka Georgian kk Kazakh km Cambodian kn Kannada ks Kashmiri ku Kurdish ky Kirghiz la Latin ln Lingala lo Laothian mg Malagasy mi Maori mk Macedonian ml Malayalam mn Mongolian mo Moldavian mr Marathi ms Malay my Burmese na Nauru ne Nepali oc Occitan or Oriya pa Punjabi ps Pashto qu Quechua rm Rhaeto rn Kirundi rw Kinyarwanda Alvestrand Expires Dec 18 1992 [Page 16] draft X.400 use of extended character sets Apr 92 sa Sanskrit sd Sindhi sg Sangro sh Serbo si Singhalese sm Samoan sn Shona so Somali sr Serbian ss Siswati st Sesotho su Sudanese sw Swahili ta Tamil te Tegulu tg Tajik th Thai ti Tigrinya tk Turkmen tl Tagalog tn Setswana to Tonga ts Tsonga tt Tatar tw Twi uk Ukrainian ur Urdu uz Uzbek vi Vietnamese vo Volapuk wo Wolof xh Xhosa yo Yoruba zu Zulu 6. REFERENCES [ISO 4873] <> 1991 revision. Replaces ISO 2022 [ISO 8859] Alvestrand Expires Dec 18 1992 [Page 17] draft X.400 use of extended character sets Apr 92 [ISO 6937] [ISO 639] Alvestrand Expires Dec 18 1992 [Page 18]