[mercury-users] Leveraging Perl

Richard A. O'Keefe ok at atlas.otago.ac.nz
Wed May 2 13:51:44 AEST 2001


Tomas By <T.By at dcs.shef.ac.uk> wrote:
	But we were discussing parsing a HTML file to look for a name, and
	that is the kind of thing I'm using my parser for. I'm vaguely aware
	there are syntactic differences between HTML and XML in the
	specifications, but given the degree to which actual pages on the web,
	and actual browsers people use, follow the specs, all that seems
	rather irrelevant...
	
Not in the least.  If you use HTMLTidy as a front end to clean up the
muck that typical commercial HTML editors generate, you get workable
clean HTML out the other end.

"The degree to which actual pages follow the specs" is such as to derail
any reasonable SGML parser, true.  But XML parsers have to be stricter
even than that.  Joe Armstrong, writing an HTML parser in Erlang, found
that he needed 16 separate stacks to cope with commercial HTML editor output.

There are numerous incompatiblities between HTML and XML.  For example,
<PRE>
foo
</PRE>
must not have any newline characters in its representation, for HTML,
but for XML it must have two of them.  For another,
<!-- this is a comment --
  -- that continues on several --
  -- lines -->
is legal HTML but illegal XML.  In fact <!> is a legal HTML comment, which
I have often seen, but not a legal XML comment.  Then of course there are
things like
	<UL COMPACT>...</UL>
which are not legal in XML and cannot be handled without looking fairly
closely at the DTD.  In fact this particular example DOES derail the xml
parser in question.

	For the record, my parser does not merely ignore the DTD - it can also
	allow unbalanced tags. This, I feel, is a *very* useful feature when
	reading real web pages.
	
There were two Zipfiles, xml.zip and xml-0.5.zip.
The xml.m in xml.zip is dated 1998.

parse_doctype(Result) -->
  ( token_string("OCTYPE") ->
    ( skip_to(">",S) ->
      { Result = ok(spec(doctype(S))) }
    ; { Result = error(dangling("doctype block")) } )
  ; { Result = error(expected("'OCTYPE'")) } ).

The xml.m in xml-0.5.zip is also dated 1998 and has the same code.

What is actually _done_ with the result?  It is included in a spec(_)
element, is all.  I call that "ignoring the DTD".

Do I need to mention that the definition here is *wrong*?  That it will
misbehave very badly on perfectly legal XML documents?

One consequence of ignoring the doctype is that this program systematically
gets HTML (even XHTML) wrong, because it doesn't know what any of the
(X)HTML character names (general entities) are.  The elements
<i>x&sup2;</i>
<i>x²</i>
<i>x²</i>
are supposed to have _identical_ content.

Supporting tag omission is nice, but this parser only does half the job.
It doesn't handle elements where _both_ tags are omitted.
For example,
    <title>This is legal HTML</title>
    <p>Here is its only paragraph.
is *really*
    <html>
     <head><title>This is legal HTML</title></head>
     <body><p>Here is its only paragraph.</p></body>
    </html>
and of the 7 omitted tags, this xml parser will only recover 1.  To know
about the others, you have to process the DTD.

Supporting tag omission is nicer when it works.
Consider the following *validated* HTML document.
    <title>Goo!</title>
    <ul><li><p>Gooey<p>goo<li><p>for a goo-goose</ul>
equivalent to
    <html>
      <head><title>Goo!</title></head>
      <body>
       <ul>
        <li><p>Gooey</p></li>
        <li><p>for a goo-goose</p></li>
       </ul>
      </body>
     </htlm>

Given that this parser doesn't know what's in the DTD, we _don't_ expect
it to figure out the <html>, <head>, and <body> tags, but we _do_ expect
it to insert the omitted </p> and </li> tags.

It doesn't.  I patched test.m to call html_stream_to_tree/3 instead of
xml_stream_to_tree/3, and this is the output I got:

    for a goo-goose<p/>

Presumably that's why it's an 0.5 version.  To be honest, I _was_ expecting
the bracketing to be wrong, that's why I constructed that particular example.
I _wasn't_ expecting most of the input to quietly disappear.

The other aspect where non-validating parsers fall down very badly is in
attribute defaults.  Given
    <!DOCTYPE foo [
      <!ELEMENT foo EMPTY>
      <!-- parser stops  ^ there, which is wrong-->
      <!ATTLIST foo bar CDATA #IMPLIED ugh CDATA "fred">
    ]>
    <foo/>
the element _should_ have attribute ugh="fred", but it won't.
And yes, there are elements in HTML that have default values for attributes,
and yes, some of them are attributes whose values I have wanted.

	> But XML is "how to transfer structured data from one program to another
	> when you have never bothered to find out about Lisp"; it is unwritable
	> by humans, and except in tiny examples, largely unreadable by humans.
	> Its parent meta-notation, SGML, *is* human-writable and human-readable,
	> as well as noticably more compact.
	
	You might exaggerate a bit here. I have written non-trivial amounts of
	documentation in Docbook, using only XML syntax, and I find that
	doable. It's not exactly pretty but neither is SGML.
	
I exaggerate very little.  Human processability was very explicitly
a design goal of SGML, and very explicitly NOT a design goal of XML.
I'm having to write some DocBook stuff myself, wishing desperately for
good documentation about it (and yes, I _have_ the O'Reilly DocBook book,
and no, in my view it is _not_ adequate documentation).

	> With reference to ftp://ftp.dcs.shef.ac.uk/home/tomas/
	> my Postscript viewer reports a Postscript error [...]
	
	Sigh. (Thanks for the info)
	
There is also a bug in test.m in the xml-0.5.zip, or at least Mercury
wouldn't compile it until I made a change.

    xml__write(NewStream,IO1,IO)
should be
    io__stdout_stream(S,IO1,IO2),
    xml__write(S,NewStream,IO2,IO)


There's an XML parser in the mercury-extras kit.

I *still* think that an SGML parser is a useful thing, and that since there
is a small fast free one that was _designed_ to be used with a Prolog system,
porting it to Mercury would be a Good Thing.

--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list