[mercury-users] Leveraging Perl

Richard A. O'Keefe ok at atlas.otago.ac.nz
Wed May 2 10:46:08 AEST 2001


I wrote:
	> By far the simplest way to [read HTML] would be to use SWI Prolog, with its
	> SGML parser that lets you slurp in HTML or most other SGML applications
	> (*not* just XML) as a Prolog data structure and then process it as a tree.
          ^^^^^^^^^^^^^^^^

Let me stress that.  The SWI Prolog XML parser is not the world's fastest
XML parser (although I have reason to believe that it could be about 25%
faster if a certain interface were turned inside-out).  Thing is, it's
an ***SGML*** parser, equally at home with RDF and DocBook.  As an SGML
parser, it is considerably faster than James Clark's nsgmls.

Tomas By <T.By at dcs.shef.ac.uk> wrote:
	My Mercury XML parser has a very similar interface, and is quite fast
	(since it does not use higher order combinators - lesson here)
	
	ftp://ftp.dcs.shef.ac.uk/home/tomas/xml-0.5.zip
	(see also ftp://ftp.dcs.shef.ac.uk/home/tomas/xml.ps)
	
	No need to port the SWI prolog parser IMHO - somebody just need to
	write documentation for mine. I'd do it if I had the time.
	
It's nice to have an XML parser in Mercury, but an XML parser cannot parse
even clean HTML, because HTML is not XML.  (XHTML is, of course.)  It isn't
terribly difficult to write an XML parser; I have one of my own that's about
1000 lines of C and noticeably faster than expat.  I seem to have collected
3 different XML parsers in Smalltalk, and am aware of 2 others.

But XML is "how to transfer structured data from one program to another
when you have never bothered to find out about Lisp"; it is unwritable
by humans, and except in tiny examples, largely unreadable by humans.
Its parent meta-notation, SGML, *is* human-writable and human-readable,
as well as noticably more compact.

An SGML parser is a nice thing to have around too.

With reference to ftp://ftp.dcs.shef.ac.uk/home/tomas/
my Postscript viewer reports a Postscript error in dcam.ps, making the
file viewable.
    PageView V3.0
    Imaging: /export/users/okeefe_r/mercury.d/dcam.ps
    at 85 dpi on USLetter paper
    %% [Error: configurationerror Offending Command: setpagedevice ] %%

(I don't know why it picks USLetter paper.  We use A4 and have _never_ used
US paper sizes.)
The same thing happens with xml.ps:

    PageView V3.0
    Imaging: /export/users/okeefe_r/mercury.d/xml.ps
    at 85 dpi on USLetter paper
    %% [Error: configurationerror Offending Command: setpagedevice ] %%

When you have a Postscript file generated from a dvi file, it might be an
idea to provide the dvi as well.

Some important things to know about any XML parser:
 - how do you ask it NOT to report comments?
 - how do you ask it NOT to distinguish CDATA section content from other text?
 - how do you ask it NOT to distinguish < from <
 - how does it distinguish white-space text occurring in element context
   from other text, or is the some option to make it NOT report such text?
In short, how do you make its output actually agree with the XML 1.0 spec?

There is one *major* difference between Jan Wielemaker's SWI SGML/XML
Tomas By's XML parser, and that is that the latter completely ignores
the DTD.  It saves the <!DOCTYPE ... [...]> part as a string, but does
nothing with it.  That means, for example, that almost every XML document
I have ever written or generated would be misinterpreted...

I still think it would be a good idea to port the SWI SMGL parser to Mercury.
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list