2011-06-19

Parsing and serializing with SAX

Several years ago, I started working on the XML parser that Gilbert Baumann had written as part of his web browser called "Closure". Since then, "Closure XML" (or cxml for short) has developed into a set of little libraries, one main goal being completeness and correctness with regard to the various standards they are following.

And standards abound in XML land, which is nice for implementors (thanks to the good test suites!) and nice for users (because the specs partially serve as documentation, and make it easy to transition between different languages implementing them). But I've always tried to release cxml with enough documentation to get users started for all the parts that are implementation-specific. And not all areas are covered by standards: Of course, the document format itself is specified strictly; the same goes for XPath, XSLT, schemas, etc.

But little is standardized in terms of API support, and that sort of choice is generally good: After all, a Lisp XML parser should fit into the Lisp world and not mimick (say) JavaScript too much. But many good ideas can be borrowed from other languages. Examples inspired by Java are STP, motivated heavily by XOM (and tweaked for added lispiness) -- and SAX:

SAX is a classic Java API. It defines a protocol of methods that get called by an XML parser, and each method call signifies an event (e.g. that the parser saw a XML tag). In cxml, SAX is one of two fundamental APIs offered (the other being a StAX-like pull-based interface), and it's essential to its inner workings. Yet I had never bothered to document it fully. For one thing, everyone seemed to know SAX from Java anyway. It's also hidden from view for most users. And ultimately, it's just a list of generic functions, right?

Technically it is just that, and yet it's central to communication between cxml's libraries, and it makes parsing and serialization in cxml modular and reusable. Hence some users had long suggested to me that I should explain SAX in full.

So here it is: The SAX overview.

TL;DR: Skip to the link above.

2011-06-18

SBCL on github and gitorious

SBCL moved to git last week, and with the main repository hosted on sf.net, several users have asked for an SBCL mirror on github.

Automatically synced mirrors are available on github and gitorious (where SBCL branches had, of course, also been uploaded by contributors long before the official switch to git), and it will now be easier than ever to make your own repository and branches.

The mirrors update several times a day:

Please test, and good luck with your SBCL hacking efforts!

2010-04-07

More looping

For context, see Nathan's blog. [Edit: I've just learned on IRC from Maciej Katafiasz that iterate also has a feature built in which renders this hack unnecessary. Even more iterate awesomeness! I'll leave this post up though, because I still think it' s pretty cute that SBCL optimizes this case so nicely.]
MY-USER> (defun outer-collect-in-inner-loop ()
           (declare (optimize speed (safety 0) (debug 0)))
           (iter (for outer from 0 below 10)
                 (declare (type (integer -1 10) outer))
                 (flet ((outer-collect (x) (collect x)))
                   (iter (for inner from 0 below 10)
                         (declare (type (integer -1 10) inner))
                         (outer-collect (+ (* 10 outer) inner))))))
OUTER-COLLECT-IN-INNER-LOOP
Note that the COLLECT in the outer loop is ITERATE's collecting feature. The trick is to use it in a FLET before starting the inner loop, thereby making it available further down. Declarations added only to get a nicer disassembly.
MY-USER> (OUTER-COLLECT-IN-INNER-LOOP)
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99)
Okay, it works. But how is it being compiled?
MY-USER> (disassemble 'OUTER-COLLECT-IN-INNER-LOOP)
; disassembly for OUTER-COLLECT-IN-INNER-LOOP
; 069222FF:       31C9             XOR ECX, ECX               ; no-arg-parsing entry point
;      301:       BE17001020       MOV ESI, 537919511
;      306:       BF17001020       MOV EDI, 537919511
;      30B:       BA17001020       MOV EDX, 537919511
;      310:       48C7C1F8FFFFFF   MOV RCX, -8
;      317:       E977000000       JMP L5
;      31C:       90               NOP
;      31D:       90               NOP
;      31E:       90               NOP
;      31F:       90               NOP
;      320: L0:   31DB             XOR EBX, EBX
;      322:       48C7C3F8FFFFFF   MOV RBX, -8
;      329:       EB5B             JMP L4
;      32B:       90               NOP
;      32C:       90               NOP
;      32D:       90               NOP
;      32E:       90               NOP
;      32F:       90               NOP
;      330: L1:   488BD1           MOV RDX, RCX
;      333:       486BD20A         IMUL RDX, RDX, 10
;      337:       4C8BCB           MOV R9, RBX
;      33A:       4E8D040A         LEA R8, [RDX+R9]
;      33E:       4989AC24B8000000 MOV [R12+184], RBP
;      346:       4D8B5C2468       MOV R11, [R12+104]
;      34B:       498D5310         LEA RDX, [R11+16]
;      34F:       4939542470       CMP [R12+112], RDX
;      354:       765C             JBE L7
;      356:       4989542468       MOV [R12+104], RDX
;      35B:       498D5307         LEA RDX, [R11+7]
;      35F: L2:   4931AC24B8000000 XOR [R12+184], RBP
;      367:       7402             JEQ L3
;      369:       CC09             BREAK 9                    ; pending interrupt trap
;      36B: L3:   4C8942F9         MOV [RDX-7], R8
;      36F:       48C7420117001020 MOV QWORD PTR [RDX+1], 537919511
;      377:       4881FE17001020   CMP RSI, 537919511
;      37E:       7529             JNE L6
;      380:       488BF2           MOV RSI, RDX
;      383:       488BFA           MOV RDI, RDX
;      386: L4:   4883C308         ADD RBX, 8
;      38A:       488BD3           MOV RDX, RBX
;      38D:       4883FA50         CMP RDX, 80
;      391:       7C9D             JL L1
;      393: L5:   4883C108         ADD RCX, 8
;      397:       488BD1           MOV RDX, RCX
;      39A:       4883FA50         CMP RDX, 80
;      39E:       7C80             JL L0
;      3A0:       488BD6           MOV RDX, RSI
;      3A3:       488BE5           MOV RSP, RBP
;      3A6:       F8               CLC
;      3A7:       5D               POP RBP
;      3A8:       C3               RET
;      3A9: L6:   48895701         MOV [RDI+1], RDX
;      3AD:       488BFA           MOV RDI, RDX
;      3B0:       EBD4             JMP L4
;      3B2: L7:   6A10             PUSH 16
;      3B4:       4C8D1C25601B4200 LEA R11, [#x421B60]        ; alloc_tramp
;      3BC:       41FFD3           CALL R11
;      3BF:       5A               POP RDX
;      3C0:       488D5207         LEA RDX, [RDX+7]
;      3C4:       EB99             JMP L2

2009-08-12

cl-perec blog series by pinterface

For several years, the best (and only) option for portable access to relational databases from Common Lisp has been the CLSQL library, a free reimplementation of LispWorks' CommonSQL interface with support for many backend databases, a lispy SQL abstraction as well as simple object-relational mapping support.

Recently, several alternatives have emerged: Postmodern provides particularly good support tailored to PostgreSQL only, other libraries abstract away from relational databases entirely, and finally, there are two new stars on the library horizon: CL-RDBMS and CL-PEREC.

Having used CLSQL for several years, I am currently investigating a switch to CL-RDBMS and PEREC for a code base of non-trivial size, aiming to improve scalability, portability across databases, and last not least readability of the code base.

And I am very happy with everything I have found so far:

In the CL-RDBMS/PEREC ecosystem, work is split into two layers.

  • CL-RDBMS forms the lower layer. It abstracts over database access libraries (currently with support for PostgreSQL, Oracle, and SQLite) and defines an optional, Lispy, extensible syntax for SQL for data definition and query statements.
  • The optional PEREC layer is the one which really shines. It is an object-relational mapping (ORM) seamlessly integrating CLOS classes with database tables.

Highlights:

  • Sensible caching model: Configurable caching of slots within a transaction, but not between transactions.
  • Object identity preserved (within a transaction).
  • Protection against accidental cross-transaction access.
  • Lispy SELECT statements using code that looks like ordinary CLOS use, but which usually compiles down to efficient database-side SQL statements.

Overall, an architecture that is ideal for robust, scalable setups with multiple threads and processes without risk of the "oversold concert tickets" situation reported for a popular Lisp web site based on older ORM software a few years ago...

The only fly in the ointment for me was the lack of introductory material. PEREC comes with an extensive test suite and nice examples, but I would have wished for a few more hints to get me started.

Thanks to fellow blogger pinterface, my wish has been granted. He has written a blog series about PEREC with many details about getting set up, and including various customization tips.

Articles in the series:

  1. Getting Started with cl-perec
  2. Persisting Simple Types with cl-perec
  3. Sensible Serializing with cl-perec
  4. Peering Down the Rabbit Hole with cl-perec

PEREC is still evolving, but the tips presented are definitely helpful for PEREC as implemented today. Also check the comments on the blog articles, written by PEREC's author!

2009-04-14

Ben Hyde's project dependency graph

Ben Hyde writes about clbuild and has a very nice graph of project dependencies in his post.

His approach is extraordinarily simple and uses a small Lisp program to parse clbuild's dependencies file and write out a .dot file. Of course, using Graphviz is a well-known approach for this kind of dependency diagram. Bill Clementson mentioned Richard Newman's graphs a few years ago.

But Ben's graph is a fresh take on this because of the way it uses clbuild's recorded dependency information, which shows dependencies of projects rather than dependencies of systems. What's the difference between projects and systems? Lisp projects often contain multiple .asd files.

Before thinking about the details of that distinction, let's look at the picture. Here's his graph, showing Hunchentoot, Drakma, cxml, and others:

Observations:
  • Yes, there's a lot of Babel in there. (It deserves it!)
  • The project dependency graph is not transitive. Note how CXML depends on Babel, which in turn needs Stefil. But CXML itself does not depend on Stefil. How can that be? It's because the system CXML depends only on the system babel. And that's not where the use of Stefil comes from. It's actually just babel-tests that needs Stefil, and that isn't needed when you're compiling CXML.
  • Where there are indirect dependencies, the graph shows them in full. Note that CL+SSL has an arrow to babel here, although it doesn't use it directly. This indirect dependency is due to its use of CFFI.

While probably not suitable for all dependency graphs, I think that this explicit display of indirect dependencies gives very nice results in this case, because it highlights commonly used libraries like Babel more than an ordinary graph would have done.

You can download his code at github. Note that you don't even have to download the projects that you want to compute a graph for, if you replace the call to directory with any other test of your choice.

2009-04-04

XSLT 1.0 implemented in Common Lisp

Newly released: Xuriella XSLT, the first implementation of XSLT written entirely in Common Lisp.

Based on Plexippus XPath, Xuriella is an implementation of XSLT 1.0, written by Ivan Shvedunov and me.

Xuriella is quite complete and correct -- we run the official testsuite, with more than 95% of tests currently passing.

Extensions

One advantage of a pure-Lisp implementation is that extension elements (as well as XPath extensions) can be defined easily.

That's a huge plus because XSLT itself is a very specialized programming language -- it excels at XML/HTML generation and transformation only. Being able to write custom extensions in Lisp helps with any non-XML-ish parts of the task which XSLT itself might not handle conveniently.

Documentation

If you just want to try applying stylesheets, there are only two functions you need to know about: parse-stylesheet and apply-stylesheet.

For details about these functions (and all others, including those for extensions), refer to the API documentation.

Example

The example uses Hunchentoot and Xuriella with XSLT as a template language in a simple directory listing request.

2008-12-07

Plexippus XPath released

The first version of Ivan Shvedunov's Plexippus XPath has been released, an implementation of XPath 1.0, written entirely in Common Lisp.

Plexippus has been available from darcs for a few months now, and we have had some early users (thanks for testing the code at that early stage), but this is the first tarball release. New tarballs for cxml and its other add-on projects are also available now, so that you don't have to juggle with git, darcs, and CVS anymore.

Beta version: SBCL-only

Plexippus is well-tested on SBCL, but has not yet been ported to any other Lisp implementation. That's why this release is declared as a beta version.

(Ports to other Lisps mainly involve support code for non-trapping floating-point arithmetic. Patches are welcome.)

The XPath protocol

For extensibility, Plexippus defines a protocol of generic functions to implement the XPath data model, so any object model for XML can be queried using Plexippus if it implements these generic functions. Out-of-the-box, there is support for cxml-STP (a DOM alternative) and cxml's DOM.

Some examples

Let's parse a test document first (no XPath involved yet):

CL-USER> (defparameter *document*
    (cxml:parse "<test a='1' b='2'>
                          <child>hello world</child>
                        </test>"
         (stp:make-builder)))
*DOCUMENT*

Find and add the two attributes as numbers:

CL-USER> (xpath:evaluate "test/@a + test/@b" *document*)
3.0d0

Find the element called child and its string value. (Using string() in the expression itself would also have worked.)

CL-USER> (xpath:string-value (xpath:evaluate "//child" *document*))
"hello world"

Iterate over all elements, in document order:

CL-USER> (xpath:do-node-set (node (xpath:evaluate "//*" *document*))
    (format t "found element: ~A~%"
     (xpath-protocol:local-name node)))
found element: test
found element: child

More examples here.