As I have been working on this project, I noticed some duplication of the BNFs, but with different requirements, and this bothered me. I felt this was required by production 80,
EncodingDecl ::= S ‘encoding’ Eq (‘”‘ EncName ‘”‘ | “‘” EncName “‘” )
The key in this production is that EncName names the code set for the input, implying that you may have to change your transcoding mechanism. It seemed the most forthright approach was to use the parser to pull off the encoding name, so I duplicated the relevant BNFs and used a separately scoped recursive-descent parser to properly recover the encoding name.
Then I ran into production 28:
doctypedecl ::= ‘<! DOCTYPE’ S Name (S ExternalID)? S? (‘[‘ intSubset ‘]’ S?)? ‘>’
which looks rather harmless, until you read this:
Well-formedness constraint: External Subset
The external subset, if any, MUST match the production for extSubset.
I looked at that, and then traced out extSubset a little bit. I could see it would require a different return value than the general case. Here’s extSubset:
extSubset ::= TextDecl? extSubsetDecl
and if you trace it out, you see a lot of use of other, common, productions. Common implies I’d have to duplicate them for the new functionality. That irritated me – inelegant AND (to quote my colleague Chris Johnson) “copies aren’t”.
So I have worked out a refactoring. The single package, sax, is now two: sax and sax_foundation. sax_foundation will be a generic package, which is to say, it is parameterized, in this case, along with the handlers [noted below], it also accepts a specification of a return value. sax now manages the parsing effort. sax_foundation contains the entire BNF and provides parsing entry points for the encoding, the general case (which I think could be conflated with the encoding case), and (anticipated) extSubset. When sax needs an encoding from the document, it uses an instantiation of sax_foundation which returns a String containing the name; when it needs to execute extSubset, it’ll use an instantiation of sax_foundation that returns the Dtd structure; and in the general case, the return value will be a structure containing the handlers (content, lexical, error, and entity – so far), which is the structure sax itself will return to callers.
I am currently running regression tests to make sure the new structure works at least as well as the old structure. One batch of problems resulted from splitting sax in two – exceptions went from sax to the inner, hidden sax_foundation, which is not appropriate, so I split the exceptions into an independent package that everyone can access. The other batch of problems has come from inadvertently losing behavior that was implemented by the now-eliminated duplicate BNFs for EncodingDecl; this has been only a couple. Once these are cleaned up, regression should be clean. Then I can implement the link to extSubset. This will be done using a callback function. sax will send the callback function, and when it’s called (by production 28) it will invoke a new sax_foundation customized for returning a new Dtd (since that’s what extSubset is all about, as I understand it) and process the extSubset of the new instantiation using a yet to be defined entry point.
The callback function is, essentially, a hack, but not a bad one, since it removed a great deal of duplicated code.
On my TODO list: find (or, reluctantly, construct) a set of XML pages which may be considered to be an official test suite for XML parsers, including expected responses for a SAX-style parser. Share the knowledge if you happen to know of such a suite!
Also, finding a few huge XML pages will be necessary as I’m interested to see if an internal hack I constructed for scalability has also had a positive effect on performance. Unfortunately, I don’t have a way to construct a baseline, so this won’t be a professional-grade assessment, but just a “gee-whiz, that went fast”, or not. This may also require waiting for the release of a new version of Mythryl, as the current version seems to have a problem with handling file I/O in the manner I need. (Or perhaps I just haven’t figured it out.)
Sadly, there are concerns for the future of Mythryl. Earlier this year, the primary (and only) compiler developer, Cynbe ru Taren, was afflicted with colorectal cancer. Today I received news that while the removal of the cancer went well, he now has lung cancer. He continues to work on the project, but I do not believe he has any fellow developers, so if he goes down, the Mythryl project will probably die with him, which would be a shame. It has some great potential, and brings into sharp relief the excessive dangers of C & C++; the possibility of using formal methods for verifying code, impossible (or extremely difficult) in other languages, is quite alluring. If you think you’re a compiler hacker with some sharp skills, you might want to consider getting involved. Cynbe has done the majority of the lifting (which is to say, translating an academic project into a production-grade compiler), but I’m sure it could still use a lot of work.