From: Paul D. Smith (pausmith@nortelnetworks.com)
Date: Thu Jan 25 2001 - 18:31:48 EST
Unless I'm missing something very basic, the uri.c:xmlNormalizeURIPath()
function in libxml is very broken. This is of the latest in CVS as of
about 10 mins ago (after 2.2.12 was announced).
I noticed this when I has a relative path in an XInclude and although
the relative path was correct, libxml kept insisting that the file
didn't exist (printing a different path than the one I gave it).
Tracing through I discovered this function has a number of problems.
Here is a tiny test program I wrote to test different behaviors of the
function against the RFC 2396 spec:
-------------------------------------------------------------------------------
#include <stdio.h>
int xmlNormalizeURIPath(char *);
const char *tests[] =
{
/* Test xxx/.. removal */
"/foo/../bar",
"foo/../bar",
"./foo/../bar",
"foo/./../bar",
"foo/bar/.././../baz",
"foo/..",
"foo/bar/..",
/* Test ./ removal */
"./foo",
"././foo",
".././foo./",
".././foo/.",
/* Test some OK things */
"/foo",
"../foo",
"../../foo",
"../../../foo",
NULL
};
int
main(int argc, char *argv[])
{
char buf[1024];
const char **tpp;
for (tpp=tests; *tpp; ++tpp) {
strcpy(buf, *tpp);
if (xmlNormalizeURIPath(buf))
fprintf(stderr, "Error normalizing `%s'!\n", *tpp);
else
fprintf(stdout, "`%s' => `%s'\n", *tpp, buf);
}
return 0;
}
-------------------------------------------------------------------------------
Linking this with libxml built from CVS, I get this output. Based on my
reading of RFC 2396, here is how I see the results:
`/foo/../bar' => `/bar' # ok
`foo/../bar' => `foo/../bar' # wrong
`./foo/../bar' => `./bar' # wrong
`foo/./../bar' => `foo/../bar' # wrong
`foo/bar/.././../baz' => `/baz' # wrong
`foo/..' => `foo/..' # wrong
`foo/bar/..' => `foo/../baz' # way wrong! whoa! buffer overrun?
`./foo' => `./foo' # wrong
`././foo' => `./foo' # wrong
`.././foo./' => `../foo././' # whoa again! buffer overrun?
`.././foo/.' => `../foo/' # ok ... ? (see below)
`/foo' => `/foo' # ok
`../foo' => `../foo' # ok
`../../foo' => `../../foo' # ok
`../../../foo' => `../foo' # wrong
I rewrote this function; I include the new version below in its entirety
rather than a patch since the function is very different (and it's all I
changed). There are lots of comments there. With the version below, I
get this output which I think is correct in all cases:
`/foo/../bar' => `/bar'
`foo/../bar' => `bar'
`./foo/../bar' => `bar'
`foo/./../bar' => `bar'
`foo/bar/.././../baz' => `baz'
`foo/..' => `'
`foo/bar/..' => `foo/'
`./foo' => `foo'
`././foo' => `foo'
`.././foo./' => `../foo./'
`.././foo/.' => `../foo/'
`/foo' => `/foo'
`../foo' => `../foo'
`../../foo' => `../../foo'
`../../../foo' => `../../../foo'
I've obeyed the spec as best I could, but there are two areas where the
results seem contrary to what I'd expect:
foo/bar/.. expands to "foo", or "foo/"?
foo/. expands to "foo", or "foo/"?
The spec seems to imply that you keep the trailing "/", so that's what
the code does, although my instinct would tell me to remove it. I guess
it depends on the definition of "complete path segment" as used by this
algorithm.
I also tried to make sure it would handle situations with multiple
adjacent "/" (like "foo/////bar") although I didn't actually test this.
I don't know if you can have multiple slashes or not by the time you get
to this algorithm, but this algorithm definitely doesn't really address
what to do with multiple slashes.
Anyway, here it is. It fixed my problem; hopefully it'll help others.
-------------------------------------------------------------------------------
/**
* xmlNormalizeURIPath:
* @path: pointer to the path string
*
* Applies the 5 normalization steps to a path string--that is, RFC 2396
* Section 5.2, steps 6.c through 6.g.
*
* Normalization occurs directly on the string, no new allocation is done
*
* Returns 0 or an error code
*/
int
xmlNormalizeURIPath(char *path) {
char *cur, *out;
if (path == NULL)
return(-1);
/* Skip all initial "/" chars. We want to get to the beginning of the
* first non-empty segment.
*/
cur = path;
while (cur[0] == '/')
++cur;
if (cur[0] == '\0')
return(0);
/* Keep everything we've seen so far. */
out = cur;
/*
* Analyze each segment in sequence for cases (c) and (d).
*/
while (cur[0] != '\0') {
/*
* c) All occurrences of "./", where "." is a complete path segment,
* are removed from the buffer string.
*/
if ((cur[0] == '.') && (cur[1] == '/')) {
cur += 2;
continue;
}
/*
* d) If the buffer string ends with "." as a complete path segment,
* that "." is removed.
*/
if ((cur[0] == '.') && (cur[1] == '\0'))
break;
/* Otherwise keep the segment. */
while (cur[0] != '/') {
if (cur[0] == '\0')
goto done_cd;
(out++)[0] = (cur++)[0];
}
(out++)[0] = (cur++)[0];
}
done_cd:
out[0] = '\0';
/* Reset to the beginning of the first segment for the next sequence. */
cur = path;
while (cur[0] == '/')
++cur;
if (cur[0] == '\0')
return(0);
/*
* Analyze each segment in sequence for cases (e) and (f).
*
* e) All occurrences of "<segment>/../", where <segment> is a
* complete path segment not equal to "..", are removed from the
* buffer string. Removal of these path segments is performed
* iteratively, removing the leftmost matching pattern on each
* iteration, until no matching pattern remains.
*
* f) If the buffer string ends with "<segment>/..", where <segment>
* is a complete path segment not equal to "..", that
* "<segment>/.." is removed.
*
* To satisfy the "iterative" clause in (e), we need to collapse the
* string every time we find something that needs to be removed. Thus,
* we don't need to keep two pointers into the string: we only need a
* "current position" pointer.
*/
while (1) {
char *segp;
/* At the beginning of each iteration of this loop, "cur" points to
* the first character of the segment we want to examine.
*/
/* Find the end of the current segment. */
segp = cur;
while ((segp[0] != '/') && (segp[0] != '\0'))
++segp;
/* If this is the last segment, we're done (we need at least two
* segments to meet the criteria for the (e) and (f) cases).
*/
if (segp[0] == '\0')
break;
/* If the first segment is "..", or if the next segment _isn't_ "..",
* keep this segment and try the next one.
*/
++segp;
if (((cur[0] == '.') && (cur[1] == '.') && (segp == cur+3))
|| ((segp[0] != '.') || (segp[1] != '.')
|| ((segp[2] != '/') && (segp[2] != '\0')))) {
cur = segp;
continue;
}
/* If we get here, remove this segment and the next one and back up
* to the previous segment (if there is one), to implement the
* "iteratively" clause. It's pretty much impossible to back up
* while maintaining two pointers into the buffer, so just compact
* the whole buffer now.
*/
/* If this is the end of the buffer, we're done. */
if (segp[2] == '\0') {
cur[0] = '\0';
break;
}
strcpy(cur, segp + 3);
/* If there are no previous segments, then keep going from here. */
segp = cur;
while ((segp > path) && ((--segp)[0] == '/'))
;
if (segp == path)
continue;
/* "segp" is pointing to the end of a previous segment; find it's
* start. We need to back up to the previous segment and start
* over with that to handle things like "foo/bar/../..". If we
* don't do this, then on the first pass we'll remove the "bar/..",
* but be pointing at the second ".." so we won't realize we can also
* remove the "foo/..".
*/
cur = segp;
while ((cur > path) && (cur[-1] != '/'))
--cur;
}
out[0] = '\0';
/*
* g) If the resulting buffer string still begins with one or more
* complete path segments of "..", then the reference is
* considered to be in error. Implementations may handle this
* error by retaining these components in the resolved path (i.e.,
* treating them as part of the final URI), by removing them from
* the resolved path (i.e., discarding relative levels above the
* root), or by avoiding traversal of the reference.
*
* We discard them from the final path.
*/
if (path[0] == '/') {
cur = path;
while ((cur[1] == '.') && (cur[2] == '.')
&& ((cur[3] == '/') || (cur[3] == '\0')))
cur += 3;
if (cur != path) {
out = path;
while (cur[0] != '\0')
(out++)[0] = (cur++)[0];
out[0] = 0;
}
}
return(0);
}
-------------------------------------------------------------------------------
-- ------------------------------------------------------------------------------- Paul D. Smith <psmith@baynetworks.com> HASMAT--HA Software Methods & Tools "Please remain calm...I may be mad, but I am a professional." --Mad Scientist ------------------------------------------------------------------------------- These are my opinions---Nortel Networks takes no responsibility for them. ---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Thu Jan 25 2001 - 18:44:04 EST