RE: [xml] encoding, SAX callbacks

Date view Thread view Subject view Author view

From: James McCann (james@votehere.net)
Date: Fri Feb 23 2001 - 14:31:29 EST


>
>IMHO, getting the characters in the SAX callbacks in the encoding of the
>document is the worst of all choices. Now the application has to know
>about myriads of encodings, instead of the parser.

I don't understand this thinking. I am working on a large project w/
other people, and our apps use ISO-8859-1. We want to understand and
use precisely one encoding, ISO-8859-1. This is the encoding used in
our XML documents. We do not want to handle UTF8 or any other encoding.
libxml's lack of flexibility in this regard means that we have to add
code (which I of course simply copied from libxml) to convert from
an encoding which we have no desire to use to our desired encoding. One
of the features of libxml which initially attracted me was its ability
to handle multiple encodings. Now I find that I must use my preferred
encoding, and in addition translate the encoding used internally by
libxml.

>There would be some value in the possibility to specify the desired
>encoding for the callbacks, if you have a UTF16 or 8859-x centric
>application. But this can easily implemented on top of the existing SAX
>callback.

I agree that there is "value in the possibility to specify the desired
encoding for the callbacks" but I would drop the qualification. I also
agree that it can easily be implemented, and I propose to do this w/in the
library. libxml already does UTF8 -> various other encodings and vice
versa, so why burden applications w/ the need to write additional code?
I have no problem with using UTF8 as the default encoding for the SAX
callbacks as long as a mechanism exists to determine the encoding used
in the callbacks. I realize there will be memory use and performance
drawbacks, but if a user requests a certain encoding, and the appropriate
transcoders exist, that encoding should be used in the callbacks.

>In fact I assume some users already have done so and can share their
>code?!

The appropriate means for sharing code is through a library. Having
someone email some code, or copying and pasting code from the library
into a project is not. libxml already knows how to translate from one
encoding to another, what I am suggesting is to make this ability
available to the user, who should not be forced to use or deal with UTF8.
There is already a flexible means for transcoding in the library, why
not extend the API to permit this in SAX callbacks? It would not break
existing code, and would benefit projects like mine which do not use
UTF8 (which I suspect is a large group of applications).

What I propose is something along these lines:

  pCtxt = xmlCreatePushParserCtxt(&SAXHandler, NULL, "", 0, 0);
  xmlSetSAXCallbacksEncoding(pCtxt, "ISO-8859-1");

Now my application only ever needs to understand ISO-8859-1.

James McCann

----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Feb 23 2001 - 14:44:54 EST