Current Project, Ctd

Continuing my faintly ridiculous hobby, I thought I’d add a note on XML validity constraints and how they interact with the EBNF of XML: which is to say, they don’t.

Which is reasonable, of course: the point of the EBNF is to specify the syntax, not the semantics.  But this does make validity checking a little more difficult.  Here’s the first example of a constraint:

Validity constraint: Root Element Type

The Name in the document type declaration MUST match the element type of the root element.

And an illustration:

<?xml version="1.0"?>
<!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting>Hello, world!</greeting> 

Note the word “greeting” in the !DOCTYPE and in the XML element.   So what does Production 1 look like?

document ::= prolog element Misc*

element is the important piece here:

element ::= EmptyElemTag | STag content ETag

Both EmptyElemTag and STag reference Production 5:

Name ::= NameStartChar (NameChar)*

And it’s this Name which we’ll match against the DTD’s name.  But, of course, Name is used in many other places, which I’ll omit listing here, so clearly we cannot modify the processing at Name to validate against the DTD; or, more accurately, a hacker might find some way to get there, but that’s not really good enough.

More importantly, the element Production is also referenced from another location, namely Production 43, and again it would be inappropriate to modify the element Production to validate a special case.  So, how to handle this, and potentially other special cases, and properly highlight the purpose of the modification to the EBNF?

I hit upon using the previously described debug mechanism.  First, I defined the notion of a Production Post Processor:

Post_Production_Processor = Handlers -> List(XML_OnePass) -> Dtd -> Int -> Int -> (Handlers, Dtd);

In English, a production post processor accepts all the SAX handlers provided by the user, the current state of processing, the DTD, and (for convenience) the current line and column numbers; it returns potentially new versions of the handlers and the DTD.  (Yes, I’m wondering if returning a new DTD is pointless.)

I added an integer (name of Production) map to Post_Production_Processor map to the internal state of the SAX parser.  This is updated by a new function, validity_constraint, which functions as a parser, accepting a production Name and Post_Production_Processor and modifying the previously mentioned map with the information before executing the success function.

As you might have guessed, then, I’ve modified the p_out function for a match between its name argument and anything in the map from name to Post_Production_Processor.  If one is found, the entry is deleted, and the post processor is executed.  The success function is then invoked, but using the results of the post processor rather than those passed in.

Usage?

vc = validity_constraint;    # purely for readability

xml_doc = p_in 1 & start_doc & prolog & vc 5 validate_dtd_name & element & <misc> & one_p & p_out 1;

validate_dtd_name is invoked when Production 5 terminates, and I’ve determined that the next Production 5 will ALWAYS be the outermost element’s name.

For comparison, here’s the original EBNF of Production 1:

document ::= prolog element Misc*

The added implementation elements are p_in, start_doc, vc, one_p, and p_out.  The transformation is rather large, but still straightforward:

p_in is the debug (and now validity constraint) mechanism;

start_doc implements the content handler’s start_document functionality;

vc, as discussed;

one_p, the processor specific to Production 1 – not all Productions have processors, but this one does (it returns the handlers to the caller);

p_out, as discussed.

Most productions only have p_in and p_out additions.  A few have processors.  So far, only Production 1 has a post processor, for which I should probably come up with a better name: validity processor, perhaps.  Let me know if you have a better name.

I should change xml_doc to document, just for consistency.  (Consistency is rarely of interest to me, sadly.)

Testing of this mechanism has yielded positive results, in that I can see the processor invoked.  I have to modify one of my tests to have a DTD in order to really test it, and I haven’t gotten that far; if a DTD is not defined, the requirement is ignored.

I look forward to trying to implement other validity constraints with this mechanism.

Bookmark the permalink.

About Hue White

Former BBS operator; software engineer; cat lackey.

Comments are closed.