Bundling libxml2 and expat compatibility layer

22 years ago by Christian Stocker — view source

unread

Dmitri

xmlParseFile does build a DOM-Tree out of your XML-Document, which is of
course slower than the SAX-parsing expat is doing..

But libxml2 can parse your XML document in SAX-style only without
building an DOM-Tree. Without looking at Sterling's code, I assume,
that's what he did and you should compare this code to expat and not
xmlParseFile..

libxml2 is certainly not slow, it has a well known reputation as being
very fast for what it does.

And if you don't know the difference between SAX and DOM, please do a
google lookup before trolling here.. Comparing SAX with DOM is comparing
Apple with Oranges

chregu

Sterling,

As I said before, I know what I am doing comparing those libraries.
If I wasn't I would not raise this discussion.

Ok. If you think I compare apples and oranges, well. Suppose I do.
But take a look what's finally offered by domxml. It's a verified tree
of nodes that were resulted from parsing original xml.
Nothing more nothing less.
I think everybody clearly understand advantages of having such trees.
I'm actually not against it. It should be quite clear thought.

Now about the matter of discussing.
It's performance. When I set proper callbacks for expat I get the same node
tree in 5 times faster.
What this "good" domxml parser spends MY time for ?
Answer is very simple. Have a look at parser.c shipped with libxml2.
It's what I'd call geeze. It's written from scratches as if we are in 19th
century.
Ok Sterling if you think this approach is ok for all, why don't switch back
to the same parser for PHP ?
Let's introduce PHP 2.0 once again :))), geeze.

What I'd love to see is Flex-based lexer for xml that has proven its really
good performance.

Let other people say what they are thinking about their needs in performance
terms.

IMHO it's too early to switch to libxml2. It's pretty slow when parsing xml.

All the best,
Dmitri.

Dmitri,

Geeze. As Christian said, you're comparing apples and oranges. In
order to properly benchmark, compare the push parser interface with
expat's interface. Or, look on the web, there have been plenty of
benchmarks libxml2 is the fastest XML parsing library available
(besides msxml, which is closed source). In terms of SAX processing,
expat has a very, very slight advantage in some situations, but nothing
to speak of.

-Sterling

Hi Christian,

I compared parsing, only parsing. Does it make sense ?
Certainly, I expected some overhead for memory allocating when building
DOM
tree.
I believe this overhead should be adequate. For example less than 3-5
times.
But actually the overhead is much higher, incredibly higher.

Could you explain what this time is spent for ? Why xmlParseFile() is so
slow ?

Also, would be nice to hear your opinion why xmlFreeDoc() is slow...
It should only free allocated memory, nothing above. I expected 1-10ms
for
it while actially got 133ms, quite comparable with time for parsing.

Also, why xmlParseMemory() is 3 times slower than xmlParseFile() ??? It
can't be explained easily, I guess.

I think libxml2 is a really SLOW library, purely slow, and will not
satisfy
people who concern about performance.

-Dmitri

I didn't look at Sterlings code yet, but you can't compare SAX parsing
of
expat with DOM parsing of libxml2. Libxml2 however does support SAX, as
well and I assume (and hope) Sterling used only this for ext/xml
replacement (making an in-memory DOM-Tree per default in ext/xml would
make a lot of people very unhappy ;) )

Dmitri, what exactly did you compare?

chregu

Hi,

Sterling, before doing such a weird thing of moving everybody to
libxml2
please compare performance of what we have with expat and what we'll
get
with libxml2.

I tested them both with quite a big xml file ~500kB. Expat parsed doc
in
19ms while libxml2 in 267ms.
It is 14 times slower. I understand that there is a big difference
between
what expat does and what libxml2.
On the other hand, there are some 3rd party xmldom-libraries that
parse
xmlfile to xmldom in ~ 110-130ms.
At least two times faster.

Also should be noted that libxml2 spends INCREDIBLY long time when
freeing
parsed document 133ms.
Moreover, when I tried to parse pre-loaded document (xmlParseMemory),
it
showed even worse results 786ms.

I believe it's too early to switch to this library. At least there
some
reasons to think more about.

Best regards,
Dmitri.

"Sterling Hughes" sterling@bumblebury.com wrote in message
news:1051978274.11377.131.camel@hasele...

Hi,

Well, OK, I have libxml2 successfully bundled with PHP, and I've
further
gone ahead and created a C-level compatibility layer which maps
expat
<-> libxml2. I've also moved the detection logic for both expat and
libxml into php5/bundle/libxml and php5/bundle/expat respectively.
This
way you can choose your backend at the configure line, and things
will
work transparently (by default, expat and libxml are compiled in,
and
the XML extension uses expat). I've also done the "namespace
redefinition" heavy lifting - I'm not quite sure it works, but I
have
renamed most (from what I can tell, all) public symbols, like with
expat. I'm sure this could be ironed out pretty easily if I made
any
mistakes.

As far as I'm concerned, the important thing here is bundling
libxml2.
I think everyone who is implementing XML support around PHP will
agree
that expat just isn't meeting our needs, specifically:

The ability to easily access and modify XML documents from within
a
programatic structure, alá DOM (this would also make it easy for me
to
implement my SimpleXML[1] extension).

a) The ability to query an XML document via Xpath

The ability to validate an XML document against either a DTD or a
XML
Schema (very important, especially for SOAP.)

Proper unicode support

Support for XPointer and XLink

Support for Docbook and HTML parsing

Expat doesn't even full support the same capabilities that
libxml2
does when it comes to SAX processing.

However, when we bundle expat with PHP, and make ext/xml therefore
an
"always available" extension. We create the illusion that it is the
"recommended" and "best" solution for XML parsing with PHP, when in
fact
it really isn't.

Our needs as far as XML support are growing, whether it be
implementing
technologies that exist on top of XML (SOAP, WSDL, RDF) or
implementing
extensions that make it easier to access XML (SimpleXML, DOM), expat
makes it way to hard (for all intensive purposes, impossible) to
implement these systems.

Therefore, I'm suggesting that we bundle libxml2, while (for now)
keeping in expat as well. This will cause absolutely no backwards
compatibility changes, while at the same time, it will allow you to
use
only libxml2 for XML processing (--without-bundle-expat), with 97%
[2]
backwards compatibility maintained.

-Sterling

[1] http://news.php.net/article.php?group=php.xml.dev&article=6
[2] This is one of 54% of facts made up on the spot. Suffice it to
say
that the new extension is "mostly" backwards compatible, and the
places
where it breaks, shouldn't have been relied upon anyway.

--
"Reductionists like to take things apart. The rest of us are
just trying to get it together."
- Larry Wall, Programming Perl, 3rd Edition

--
nam...christian stocker adr...pflanzschulstr. 31, ch-8004 zurich
pho...+41 43 317 9984 www...http://blog.bitflux.ch
mob...+41 76 561 8860 ema...chregu@phant.ch
wor...+41 1 240 5670 gpg...0x5CE1DECB
--
Good judgement comes from experience, and experience comes from
bad judgement.
- Fred Brooks

--
nam...christian stocker adr...pflanzschulstr. 31, ch-8004 zurich
pho...+41 43 317 9984 www...http://blog.bitflux.ch
mob...+41 76 561 8860 ema...chregu@phant.ch
wor...+41 1 240 5670 gpg...0x5CE1DECB

22 years ago by Dmitri Dmitrienko — view source

unread

Christian,

xmlParseFile does build a DOM-Tree out of your
XML-Document, which is of
course slower than the SAX-parsing expat is doing..

It is not obvious conclusion that any XMLDOM parsing should be slower.
More over it is competely wrong if you compare libxml2 vs expat.

If you think more you'll see that DOM-parser only allocates nodes and link
them in lists.
Should it be so much slower ??? Are you sure that allocating nodes should
slow down everything by 14 times ?
I believe it is not.

Also I don not see any reasonable explanation why libxml disposes document
so slow.
It does not need verify, it does not need parse, it's only fries nodes.
NOTHING MORE.
And takes nearly the same time as allocating/parsing and verifying.

The only obvious conclusion is that all algorithms are written
inefficiently.

I'm not against XML DOM and it's benefits. I'm against wrong and inefficient
algorithms.
People, why don't you read Donald Knouth's books ?

But libxml2 can parse your XML document in SAX-style only without
building an DOM-Tree. Without looking at Sterling's code, I assume,
that's what he did and you should compare this code to expat and not
xmlParseFile..

libxml2-2.5.7,parser.c:10670

xmlDocPtr
xmlParseDoc(xmlChar *cur) {
return(xmlSAXParseDoc(NULL, cur, 0));
}

I think the code excerpt shown above is a good answer to your arguments.

libxml2 is certainly not slow, it has a well known reputation as being
very fast for what it does.

I would not discuss reputation. It's a competely different thing.
As with performance, I still insist that libxml2 has a) pretty slow parser
due to inefficient algorithms and b) pretty slow in some other respects
including freeing documents once again due to inefficietn algorithms.

-Dmitri.