Re: [xml] Don't read XML from directories.

Date view Thread view Subject view Author view

From: Alejandro Forero Cuervo (bachue@bachue.com)
Date: Wed Jan 24 2001 - 00:37:31 EST


    
Igor does raise an interesting question as to whether libxml should or
should not check to see if a filename points to a directory before
actually open it and attempting to parse it. My reasoning is that,
even though there are some valid points as to why it should not do it,
it should. In this message I'll explain my arguments as to why it
should do it.

My setup is as follows. I guess it is pretty common. My program asks
the user to pickup a file (showing an appropiate file/URL dialog) and
immediately asks libxml to parse it. The following options are
possible:

1. I don't check to see if the user is attempting to open a
   directory, I still try to read XML from it. Rather than seeing an
   appropiate message telling him he picked a directory, he sees
   messages telling him how the contents of the directory are not a
   valid XML document.

2. Before calling the functions in libxml, my program makes sure the
   user did not pick a directory. If it did, boom, it signals an
   error. If it didn't, it does call libxml's parsing functions.

3. The checks are added inside libxml, right next to the open(2)
   calls.

One might argue for the first option, stating that it will make the
program more flexible than the others. Sure, if the user does tell us
to try to read XML from a directory, however absurd it might sound to
us, the programmers, let's not assume he made a mistake and lets
actually do what he told us. In reality, there will hardly ever be
any motivation to open a directory and parse it as XML, while it is a
common user mistake to select a directory (in the file-selection
dialog I'm using) rather than a file. Lets not waste much time
talking about this option and discard it already.

Arguments for the second option exist as well. Igor states that
"Libraries like libxml (and libxml2) should not go any deeper into the
matter of checking if a passed parameter is sane or not." Having
every different library repeat the same check all over again is one of
the problems he states.

Besides, if the programmer tells libxml to actually parse XML from a
directory, libxml should not try to be smarter ("Hey, BonzoProgrammer,
that's an error, you crap!") but rather do exactly what the programmer
wanted, however absurd. Like Igor says, the library shouldn't act as
a shield against programmer's own mistakes. And, again, flexibility.

But I said that in this message I would argue for the 3rd option. So
enough.

First, I don't agree with the strlen example and I think Igor abuses
it. I could talk about other examples where functions in libc (at
least in glibc's implementation) do make those check against NULL
pointers (such as printf("%s", NULL) or free(NULL)). Also (though you
might just ignore this argument if you don't agree with it), libxml's
xmlParseFile and strlen are completely different in that the first is
a high-level function that is supposed to be called once or twice
during the whole program execution (where adding a few microseconds of
delay does not matter), while strlen is a low-lever function that must
be -fast-.

    I am simply against the library acting as a shield before
    programmer's own mistakes. What would be next? Patching strlen
    routine in glibc so it checks for NULL argument before parsing the
    string and, if it is NULL, sends an email to
    strlenwatch@localhost, sets the appropriate errno and returns -1?

    Correctness and sanity of parameters any program passes to any
    library function must be enforced by the program itself. No
    library can ever know better what the program is about to do.

I agree completely with your second paragraph.

However, you see, correctness and sanity of parameters are completely
relative to the function's precondition.

Following your logic, why bother checking the return value for the
open(2) call in libxml at all? After all, the programmer passed a
filename that can't be opened so it's his fault for not establishing
the correctness and sanity of parameters.

Would you suggest that everyone who calls functions on libxml to parse
a file better calls open(2) on the file to make sure it can be opened
or else the unpredictable will happen?

Currently, the (implicit) precondition is "pass a filename that points
to a file (not directory; if it points to directory, errors will
occur)". So yes, for that precondition, you must make sure your
filename does point to a file, not a directory. I am just proposing
changing that (again, implicit) precondition to "pass a filename that
can point to both a file or a directory and let libxml signal an error
if it is a directory".

There is more to this, however.

The functions in libxml receive a ``filename'' but hey, it can be a
URL too. Before actually calling open on ``filename'', libxml checks
it to see if it is an HTTP URL, an FTP URL or something that must be
passed to open(2). And, of course, it only gets open(2)ed if it isn't
a URL. So doing things your way, with these checks on the caller
programs, everybody and their dog would have to copy and paste all the
code that, given the ``filename'', checks to see if it is an HTTP URL,
FTP URL or, finally, something that will get passed to open. Only
then, when the caller programs notices it is something that will get
passed to open(2), it can call stat.

What would be next? Patching all libxml caller programs so they
contain a modified version of all the code in libxml such that, before
actually calling functions in libxml, they know for sure nothing will
fail inside them and, in case they find errors (hey, the contents of
this URL that libxml will eventually download looking for XML does not
parse correctly), log them calling syslog(3) and family, send an email
to couldntestablishcorrectnessinparametersforlibxml@localhost and
call raise(SIGSEGV)?

Well, even if we did this, as libxml is supposed to evolve, the checks
might change and the programs' versions of them will get out-of-sync.

The reasoning behind having libxml accept URLs as well as filenames
and handling it all automatically is (I'm guessing) to allow the
programmer to forget about this (which is the whole purpose of
libraries). The programmer just gets a resource descriptor of some
kind and libxml will take care of finding out what it is (file, FTP
URL, HTTP URL...) and how to read it. I fail to see why would it be
wrong to add a call to stat(2) to check if ``filename'' is a file or a
directory in a function that is already examining if ``filename'' is a
path or a URL.

But I still need to make that patch. :)

Thanks.

Alejo.
http://bachue.com/alejo

--
The mere formulation of a problem is far more essential than its solution.
      -- Albert Einstein.

$0='!/sfldbi!yjoV0msfQ!sfiupob!utvK'x44;print map{("\e[7m \e[0m",chr ord (chop$0)-1)[$_].("\n")[++$i%77]}split//,unpack'B*',pack'H*',($F='F'x19). "F0F3E0607879CC1E0F0F339F3FF399C666733333CCF87F99E6133999999E67CFFCCF3". "219CC1CCC033E7E660198CCE4E66798303873CCE60F3387$F"#Don't you love Perl? ---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Jan 24 2001 - 04:43:34 EST