From: Luca Padovani (luca.padovani@cs.unibo.it)
Date: Mon Mar 05 2001 - 14:59:32 EST
some alternative subjects for this mail:
- structural validation of XML files using SAX in libxml
- semantic parsing of XML documents
- specific-markup parser generator with libxml
Hi all,
I've been using libxml for more than one year and I think it is a really
great tool. In this year I found myself parsing more than a dozen of
different markup languages, and each time I had to rewrite several
"semantic actions" to rebuild the internal data structures that XML was
representing. This is a very boring and repetitive task. Let us take a
look at the two alternatives:
a) DOM. Yeah, if you want to validate a XML file with libxml you must
built the whole tree. But then you have this recursive structure and is
is pretty easy to do the right traversal. There are blank nodes to skip,
your code relies on validation for the correct positioning of node
children, it is a bit boring to scan over node lists.
b) SAX. No additional memory is required for this parsing, but you have
no validation and the parsing proceeds in a sort of top-down fashion,
which is not what you want in some (most?) cases. In an attempt to make
SAX parsing easier I implemented a kind of run-time environment with
association between names and values, with little support for run-time
type-checking, but the approach was error-prone and resulted in using
dozens of C macro to access all the environment variables I needed.
I was looking for an approach for the easy (bottom-up) parsing of a XML
document, still having the possibility of associating semantic actions
at each stage of the parsing. Code reusability, code readability and
easiness of implementation was some of my goals. I had YACC in my mind,
and the approach I am currently investigating is using YACC as the
parser, and libxml SAX as the lexer.
In other words, SAX events coming from libxml are translated into tokens
for a user-supplied grammar specified in a YACC-like syntax. Semantic
actions can be associated to grammar reductions and you can use the
mechanism of semantic values provided by YACC to propagate parsed
results in a bottom-up manner.
The following is a trivial example of YAXP grammar:
%{{
#include <stdlib.h>
%}}
%union {{
gint n;
}}
%type <<n>> expr
%type <<n>> <expr>
%type <<n>> <constant>
%%
main:
expr {{ printf("result: %d\n", $1); }}
;
expr:
<expr>
| <constant>
| expr <plus> expr {{ $$ = $1 + $3; }}
| expr <minus> expr {{ $$ = $1 - $3; }}
| <plus> expr {{ $$ = $2; }}
| <minus> expr {{ $$ = -$2; }}
;
<expr>:
expr
;
<constant>:
TEXT {{ $$ = atoi(YAXP_GET_TEXT($1)); }}
;
<plus>:
;
<minus>:
;
%%
The grammar is made of rules, rules are made of components. Components
surrounded by < > are indeed XML elements, the other are non-terminal
symbols, like in YACC. Thus, I see from the grammar above that <plus>
and <minus> are EMPTY XML elements, <constant> is an XML element
containing just characters data, <expr> is an element whose content is
described by the non-terminal expr.
Semantic actions are surrounded by {{ }} and semantic values are
accessed by the same notation as in YACC. There are some facilities for
attributes, for example, in a rule with the form:
<table>:
bla bla {{ printf("%s\n", @border); }}
you can access the attribute "border" of the element table when it is
parsed.
The implementation of all this is fairly easy: I have a little
processor, yaxp, which translates a YAXP grammar into a YACC file. Then,
you have to process the resulting grammar with YACC, compile and link
with a little library for using libxml as "yylex".
Well, I tried to sum up what I can do with YAXP right now. The question
is: does someone find this useful too? Are you interested in
using/developing/extending YAXP? If so please let me know and I'll make
a public release of the source code in a few days.
Any ideas, criticisms, suggestions? (I have many) They are all welcome.
Best regards,
luca
---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Mon Mar 05 2001 - 16:43:41 EST