{"id":1868,"date":"2015-08-27T17:24:09","date_gmt":"2015-08-27T22:24:09","guid":{"rendered":"http:\/\/huewhite.com\/umb\/?p=1868"},"modified":"2015-09-01T18:10:45","modified_gmt":"2015-09-01T23:10:45","slug":"current-project-ctd-7","status":"publish","type":"post","link":"https:\/\/huewhite.com\/umb\/2015\/08\/27\/current-project-ctd-7\/","title":{"rendered":"Current Project, Ctd"},"content":{"rendered":"<p>As I have been working on this <a href=\"https:\/\/huewhite.com\/umb\/2015\/07\/04\/current-project-ctd-6\/\" target=\"_blank\">project<\/a>, I noticed some duplication of the BNFs, but with different requirements, and this bothered me.\u00a0 I felt this was required by production 80,<\/p>\n<blockquote><p>EncodingDecl\u00a0\u00a0\u00a0::=\u00a0\u00a0\u00a0 S &#8216;encoding&#8217; Eq (&#8216;&#8221;&#8216; EncName &#8216;&#8221;&#8216; | &#8220;&#8216;&#8221; EncName &#8220;&#8216;&#8221; )<\/p><\/blockquote>\n<p>The key in this production is that <em>EncName<\/em> names the code set for the input, implying that you may have to change your transcoding mechanism.\u00a0 It seemed the most forthright approach was to use the parser to pull off the encoding name, so I duplicated the relevant BNFs and used a separately scoped recursive-descent parser to properly recover the encoding name.<\/p>\n<p>Then I ran into production 28:<\/p>\n<blockquote><p>doctypedecl\u00a0\u00a0\u00a0::=\u00a0\u00a0\u00a0&#8216;&lt;! DOCTYPE&#8217; S Name (S ExternalID)? S? (&#8216;[&#8216; intSubset &#8216;]&#8217; S?)? &#8216;&gt;&#8217;<\/p><\/blockquote>\n<p>which looks rather harmless, until you read <a href=\"http:\/\/www.w3.org\/TR\/REC-xml\/#NT-doctypedecl\" target=\"_blank\">this<\/a>:<\/p>\n<blockquote><p>Well-formedness constraint: External Subset<\/p>\n<p>The external subset, if any, MUST match the production for extSubset.<\/p><\/blockquote>\n<p>I looked at that, and then traced out <em>extSubset<\/em> a little bit.\u00a0 I could see it would require a different return value than the general case.\u00a0 Here&#8217;s <em>extSubset<\/em>:<\/p>\n<blockquote><p>extSubset\u00a0\u00a0\u00a0::=\u00a0\u00a0\u00a0 TextDecl? extSubsetDecl<\/p><\/blockquote>\n<p>and if you trace it out, you see a lot of use of other, common, productions.\u00a0 Common implies I&#8217;d have to duplicate them for the new functionality.\u00a0 That irritated me &#8211; inelegant AND (to quote my colleague Chris Johnson) &#8220;copies aren&#8217;t&#8221;.<\/p>\n<p>So I have worked out a refactoring.\u00a0 The single package, <em>sax<\/em>, is now two: <em>sax<\/em> and <em>sax_foundation<\/em>. <em>sax_foundation<\/em> will be a generic package, which is to say, it is parameterized, in this case, along with the handlers [noted below], it also accepts a specification of a return value.\u00a0 <em>sax<\/em> now manages the parsing effort.\u00a0 <em>sax_foundation<\/em> contains the entire BNF and provides parsing entry points for the encoding, the general case (which I think could be conflated with the encoding case), and (anticipated) <em>extSubset<\/em>.\u00a0 When <em>sax<\/em> needs an encoding from the document, it uses an instantiation of <em>sax_foundation<\/em> which returns a String containing the name; when it needs to execute <em>extSubset<\/em>, it&#8217;ll use an instantiation of <em>sax_foundation<\/em> that returns the Dtd structure; and in the general case, the return value will be a structure containing the handlers (content, lexical, error, and entity &#8211; so far), which is the structure <em>sax<\/em> itself will return to callers.<\/p>\n<p>I am currently running regression tests to make sure the new structure works at least as well as the old structure.\u00a0 One batch of problems resulted from splitting <em>sax<\/em> in two &#8211; exceptions went from <em>sax<\/em> to the inner, hidden <em>sax_foundation<\/em>, which is not appropriate, so I split the exceptions into an independent package that everyone can access.\u00a0 The other batch of problems has come from inadvertently losing behavior that was implemented by the now-eliminated duplicate BNFs for <em>EncodingDecl<\/em>; this has been only a couple.\u00a0 Once these are cleaned up, regression should be clean.\u00a0 Then I can implement the link to <em>extSubset<\/em>.\u00a0 This will be done using a callback function.\u00a0 <em>sax<\/em> will send the callback function, and when it&#8217;s called (by production 28) it will invoke a new <em>sax_foundation<\/em> customized for returning a new Dtd (since that&#8217;s what <em>extSubset<\/em> is all about, as I understand it) and process the <em>extSubset<\/em> of the new instantiation using a yet to be defined entry point.<\/p>\n<p>The callback function is, essentially, a hack, but not a bad one, since it removed a great deal of duplicated code.<br \/>\nOn my TODO list: find (or, reluctantly, construct) a set of XML pages which may be considered to be an official test suite for XML parsers, including expected responses for a SAX-style parser.\u00a0 Share the knowledge if you happen to know of such a suite!<\/p>\n<p>Also, finding a few huge XML pages will be necessary as I&#8217;m interested to see if an internal hack I constructed for scalability has also had a positive effect on performance.\u00a0 Unfortunately, I don&#8217;t have a way to construct a baseline, so this won&#8217;t be a professional-grade assessment, but just a &#8220;gee-whiz, that went fast&#8221;, or not.\u00a0 This may also require waiting for the release of a new version of Mythryl, as the current version seems to have a problem with handling file I\/O in the manner I need.\u00a0 (Or perhaps I just haven&#8217;t figured it out.)<\/p>\n<p>Sadly, there are concerns for the future of Mythryl.\u00a0 Earlier this year, the primary (and only) compiler developer, Cynbe ru Taren, was afflicted with colorectal cancer.\u00a0 Today I received news that while the removal of the cancer went well, he now has lung cancer.\u00a0 He continues to work on the project, but I do not believe he has any fellow developers, so if he goes down, the Mythryl project will probably die with him, which would be a shame.\u00a0 It has some great potential, and brings into sharp relief the excessive dangers of C &amp; C++; the possibility of using formal methods for verifying code, impossible (or extremely difficult) in other languages, is quite alluring.\u00a0 If you think you&#8217;re a compiler hacker with some sharp skills, you might want to consider getting involved.\u00a0 Cynbe has done the majority of the lifting (which is to say, translating an academic project into a production-grade compiler), but I&#8217;m sure it could still use a lot of work.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I have been working on this project, I noticed some duplication of the BNFs, but with different requirements, and this bothered me.\u00a0 I felt this was required by production 80, EncodingDecl\u00a0\u00a0\u00a0::=\u00a0\u00a0\u00a0 S &#8216;encoding&#8217; Eq (&#8216;&#8221;&#8216; EncName &#8216;&#8221;&#8216; | &#8220;&#8216;&#8221; EncName &#8220;&#8216;&#8221; ) The key in this production is that \u2026 <a class=\"continue-reading-link\" href=\"https:\/\/huewhite.com\/umb\/2015\/08\/27\/current-project-ctd-7\/\"> Continue reading <span class=\"meta-nav\">&rarr; <\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1868","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/1868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/comments?post=1868"}],"version-history":[{"count":3,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/1868\/revisions"}],"predecessor-version":[{"id":1936,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/1868\/revisions\/1936"}],"wp:attachment":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/media?parent=1868"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/categories?post=1868"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/tags?post=1868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}