Re: [xml] encoding, SAX callbacks

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@imag.fr)
Date: Fri Feb 23 2001 - 15:05:22 EST


On Fri, Feb 23, 2001 at 11:31:29AM -0800, James McCann wrote:
> >There would be some value in the possibility to specify the desired
> >encoding for the callbacks, if you have a UTF16 or 8859-x centric
> >application. But this can easily implemented on top of the existing SAX
> >callback.
>
> I agree that there is "value in the possibility to specify the desired
> encoding for the callbacks" but I would drop the qualification. I also
> agree that it can easily be implemented, and I propose to do this w/in the
> library. libxml already does UTF8 -> various other encodings and vice
> versa, so why burden applications w/ the need to write additional code?
> I have no problem with using UTF8 as the default encoding for the SAX
> callbacks as long as a mechanism exists to determine the encoding used
> in the callbacks. I realize there will be memory use and performance
> drawbacks, but if a user requests a certain encoding, and the appropriate
> transcoders exist, that encoding should be used in the callbacks.

  Let's be very clear.
  You ask for ISO Latin 1, someone could ask for Shift JIS callbacks
or UTF16 ones. In either cases I would have to seriously complicate
libxml parser design to adapt dynamically to the fact that it should
lookup the tags and delimiters using X Y Z encodings.
  No way, I can't do this. So I convert before parsing. Hence I have
a single encoding to understand in the parser. Now once the parser
has decoded a chain, converting back to onther encoding is brain-dead.
I't already in a canonical form.
  You don't use the DOM limiting yourself to SAX. Try to find another
parser with a SAX only interface and not needing conversion of its
input. Maybe expat does, libxml won't. Though if you have some
time prepare a patch and send it, maybe it will be accepted if it's
clean and don't look like a maintenance nightmare.

> Now my application only ever needs to understand ISO-8859-1.

  right now it only need to understand UTF8 instead, and this
completely independantly of the encoding of the input.

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard@redhat.com  | libxml Gnome XML toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Feb 23 2001 - 16:44:45 EST