The current saga of my Mythryl coding project, the last entry with regards to replacement text in DTDs suggested I would take this approach:
The extSubset production will be handled via a callback function that can then build a new recursive descent parser that knows about parameter entity references and generate a new string with all the replacements accomplished. I have not made a decision about intSubset, or perhaps markupdecl – a solution is not readily apparent.
I ended up abandoning that approach – my first approaches tend to be a little Byzantine compared to the final approach, so it’s not surprising. However, the final approach (which is a bit of a grim thing to say in a programming project, really) is not much more appealing, as I decided to take the straightforward, but clumsy, approach of creating a new production rule:
pe_reference_replacement’ = p_in 1000 & |ws_i| & is_this “%” & name & is_this “;” & implement_replacement & p_out 1000;
pe_reference_replacement = <(irrelevant_ws & pe_reference_replacement’)> ;
This is in Mythryl using the recursive descent parser’s operators. In the second production, the <> means 0 or more matching productions, while irrelevant_ws simply means there may be discardable whitespace. Basically, there may be multiple appearances of pe_reference_replacement’, separated by optional whitespace.
The first production uses the operator ||, which indicates the contents are optional, while ws_i is a synonym for irrelevant_ws, so this part is probably redundant (and if you’re thinking, on these two items, My, he’s sloppy – well, you win a prize!). More interesting, there’s a check for a match to ‘%’, then a name, and then the required semi-colon – this is the definition of a parameter entity. If all of these are matched, then implement_replacement will be called to actually implement the replacement.
Because implement_replacement is highly dependent on my implementation back end, I’ll just say it puts the replacement text on the input and continues onward for more processing.
References to pe_reference_replacement are now sprinkled throughout the relevant productions. An example, first from the spec, then the original Mythryl code, then the updated:
EntityDef ::= EntityValue | (ExternalID NDataDecl?)
entity_def = p_in 73 & (entity_value | (external_id & |ndata_decl| )) & seventy_three_p & p_out 73;
entity_def = p_in 73 & |(pe_reference_replacement)| & (entity_value | (external_id & |ndata_decl| )) & seventy_three_p & p_out 73;
Subsidiary productions, where appropriate, also contain sprinklings of pe_reference_replacement as optional accompaniments. Is this a satisfactory approach? Barely. It obscures the the production’s purpose, and it’s prone to errors. I do not look forward to searching out bugs. But I was not able to find an approach with a better balance in the limited time I allot to this project.
My first test of this comes from the W3 document itself, specifically Appendix D, entitled “Expansion of Entity and Character References (Non-Normative)“. The example is labeled as particularly difficult, and is as follows:
1 <?xml version=’1.0′?>
2 <!DOCTYPE test [
3 <!ELEMENT test (#PCDATA) >
4 <!ENTITY % xx ‘%zz;’>
5 <!ENTITY % zz ‘<!ENTITY tricky “error-prone” >’ >
6 %xx;
7 ]>
8 <test>This sample shows a &tricky; method.</test>
(Ignore the line numbers.) I can now parse this properly and report the expected results (which consist of all of the expected entity values, as well as the proper PCDATA for element test). Yes, yes, this is hardly complete testing, but I’ve done enough damage today.