2011-06-19

Parsing and serializing with SAX

Several years ago, I started working on the XML parser that Gilbert Baumann had written as part of his web browser called "Closure". Since then, "Closure XML" (or cxml for short) has developed into a set of little libraries, one main goal being completeness and correctness with regard to the various standards they are following.

And standards abound in XML land, which is nice for implementors (thanks to the good test suites!) and nice for users (because the specs partially serve as documentation, and make it easy to transition between different languages implementing them). But I've always tried to release cxml with enough documentation to get users started for all the parts that are implementation-specific. And not all areas are covered by standards: Of course, the document format itself is specified strictly; the same goes for XPath, XSLT, schemas, etc.

But little is standardized in terms of API support, and that sort of choice is generally good: After all, a Lisp XML parser should fit into the Lisp world and not mimick (say) JavaScript too much. But many good ideas can be borrowed from other languages. Examples inspired by Java are STP, motivated heavily by XOM (and tweaked for added lispiness) -- and SAX:

SAX is a classic Java API. It defines a protocol of methods that get called by an XML parser, and each method call signifies an event (e.g. that the parser saw a XML tag). In cxml, SAX is one of two fundamental APIs offered (the other being a StAX-like pull-based interface), and it's essential to its inner workings. Yet I had never bothered to document it fully. For one thing, everyone seemed to know SAX from Java anyway. It's also hidden from view for most users. And ultimately, it's just a list of generic functions, right?

Technically it is just that, and yet it's central to communication between cxml's libraries, and it makes parsing and serialization in cxml modular and reusable. Hence some users had long suggested to me that I should explain SAX in full.

So here it is: The SAX overview.

TL;DR: Skip to the link above.

1 comment:

Anonymous said...

Did not care for the overview too much. I'm certainly not sure why just slamming together all the text would be something I'd ever want to do, for example. Sure it shows usage, I guess, but provides no pointers to things I might actually want to do with XML, as opposed to cxml. Such as how to maintain state information during the parse, so that I can know what goes with what.