From: James McCann (james@votehere.net)
Date: Thu Feb 22 2001 - 18:10:47 EST
Do you realize I am assuming SAX is an external interface?
>
> No, the internal representation is UTF8. The encoding conversion in
>done before the data is passed to the parser. I don't think it's
>a bug, but a feature :-)
>
I realize the internal representation is UTF8, but if my document is in
ISO-8859-whatever, that is the character set I wish to use and see. The
mechanism for setting the encoding of the input to the callback functions
should be similar to the DOM I/O functions: If there is an encoding set,
then when outputting a tree, that encoding should be used. If there is a
SAX parser, then when calling the callbacks, that encoding should be used.
The encoding should be under the control of the user. The user should
not be responsible for translating encodings unless this is what the user
wishes, & I would also argue that the user should be able to specify a
different encoding to be received by the callback functions.
>
> This is the intended behavior. And I'm afraid you would have
>serious troubles trying to change libxml use only the original
>encoding. Please check
> http://xmlsoft.org/encoding.html
>
> for more explanation on those issues.
I have read this. I am not suggesting keeping the internal representation
in the original encoding, but if a user has a doc in whatever encoding the
user may wish to use that encoding, in fact he most likely does. If this
were not so then you would not have bothered putting the transcoding into
the DOM output functions. I do not understand the rationale for forcing
a user to handle UTF8 if the original xml is encoded in something else.
To quote from the above referenced document:
Having internationalization support in libxml means the foolowing:
the document is properly parsed
informations about it's encoding are saved
it can be modified
it can be saved in its original encoding
it can also be saved in another encoding supported by libxml
(for example straight UTF8 or even an ASCII form)
I argue that point two is not followed by the SAX parser if callbacks are
presented text in an encoding other than that used by the document and the
spirit of points 2, 4 and 5 are broken as well. The principle seems to be
that the user can determine which encoding he wishes to handle.
James McCann
---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Thu Feb 22 2001 - 20:44:21 EST