2007-03-03

Klacks parsing

Closure XML has been based on a SAX-like API for several years now (in addition to the DOM implementation on top of that). But although the pervasive use of SAX within CXML itself has been a success story, most users seem to prefer DOM usage over SAX handler hacks. Anyone who has ever parsed a non-trivial schema using SAX knows why: Maintaining separate start-element and end-element methods is very inconvenient. Code ends up dispatching on tag names using huge case forms while doing all bookkeeping manually in slots of the handler instance.

Starting with the current release of CXML, there is now a new parser interface called Klacks.

Similar to StAX, the new interface is more convenient than SAX, while still providing the same features as the old one, including validation.

Basically, the klacks parser can be used as a (rather sophisticated) tokenizer, and you get to write a recursive descent parser based on that.

SAX and StAX are Java's protocols for XML parsing. Sometimes they are being referred to as low-level interfaces for "expert" use only (the suggested alternative being something like DOM), but their purpose is really to parse XML without building an in-memory representation.

Low-level or not, they are the right choice when parsing into application-defined data structures or when performing simple on-the-fly transformation of XML data as it is being read.

In SAX, an XML parser will process the entire document in one go, emitting events as it sees them. User code needs to implement its own handler class, with methods for the events it cares about. The SAX concept is known as "push-based".

In contrast, the "pull-based" StAX parsing model is similar to working with an input stream. User code starts by creating an input stream object for the XML document, then reads events from that stream one by one. (Klacks uses the term source instead of stream, to avoid confusion with Common Lisp streams.)

API design choices. StAX distinguishes between a high-level API, which creates a Java object for each event, and the low-level API, which just returns an enum indicating the type of event, and has separate methods to access the current event's data.

Klacks has just one set of functions for both purposes, since it seemed more lispy to use multiple values. Instead of returning just a keyword indicating the event type, the main klacks functions always include useful event data as additional return values.

Java's StAX also includes classes for XML serialization. No such extension was needed for CXML, since it already supports convenient serialization using SAX events. The with-element macro and related functions make generation of those events easy.

Simple klacks parsing example:
* (defparameter *source* (cxml:make-source "<example>text</example>"))
* (klacks:peek-next *source*)
:START-DOCUMENT
* (klacks:peek-next *source*)
:START-ELEMENT
NIL                      ;namespace URI
"example"                ;local name
"example"                ;qualified name
* ...

4 comments:

Paulo said...

I've tested performance of CXML SAX and Klacks with no processing whatsoever (i.e. parse with default-handler and a peek-next loop until nil), and Klacks is substantially slower that SAX. In fact, it's even slower that DOM.

I believe this is so because of the lispy way of doing it. I also believe this problem also happens with StAX with the event way, which creates objects for everything.

So, this enum vs object seems meaningful in terms of performance. Or I may be wrong and Klacks is not yet optimized. Which one is it?

PS: I've seen SAX code in a Klacks source file, apparently for some DTD processing. Is this relevant?

David Lichteblau said...

Hi Paolo,

klacks is definitely unoptimized currently. For me, its major feature was the pull-based approach, which is essential for some stream-based XML formats like XMPP. Obviously, I'd be glad to accept patches improving speed, but currently I am not working on klacks optimization issues myself.

As far the "lispy" API is concerned (use of multiple values is a part of that), this is not where I would start looking for speed issues. Perhaps it would be possible to add more API functions that go for raw speed, but the bulk of the problems are probably not in the API.

For one thing, klacks shares nearly all code with the SAX parser currently. It even sends SAX events for everything (not just the DTD) which then get discarded. So due to that implementation trick, klacks is automatically slower than SAX.

The creation of objects is an issue separate from the use of multiple values. What goes on here is that the values are currently consed into a list by the source. So there are actual "objects" in the current implementation, you just don't see them. This could contribute to speed issues.

But again, most of these issues should be fixable with a bit of refactoring. I just don't have the time for that right now.

Paulo said...

Thanks for the info!

BTW, after re-reading my comment, the impression I left was that I didn't like CXML. On the contrary, I like it better than all the Common Lisp libraries I've found, especially for the DOM support and DOM speed.

dontcare said...

you might also want to look at vtd-xml, the latest and most advanced XML processing API available today

http://vtd-xml.sf.net