[m-rev.] For review: xml pretty printer

Ralph Becket rafe at cs.mu.OZ.AU
Thu Jun 13 14:13:28 AEST 2002
Previous message: [m-rev.] For review: xml pretty printer
Next message: [m-rev.] trivial diff: fix asm_fast.gc.par compilation error
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Douglas Michael Auclair, Sunday,  9 June 2002:
> Dear all,
> 
> I've attached a pretty-printer for the XML parser written by Thomas Conway. 
> My contribution is primarily xml.pprint.m.  I modified xml.m to integrate 
> the pretty printer into the XML library, and tryit.m, 
> samples/newsarticles.xml and samples/newsarticles.dtd to demonstrate how to 
> use it and how it works (I've included comments in the source code, as 
> well).  I've used the LGPL, as documented in the sources.

Okay, there are a number of problems with your approach.  I'll include
the source here with comments:

Index: xml.pprint.m
+% xml.pprint.m
+% converts the object tree into formatted text output
+% author: Douglas M. Auclair
+% This file may only be copied under the terms of the GNU Library General
+% Public License - see the file COPYING.LIB in the Mercury distribution.
+
+:- module xml:pprint.
+:- interface.
+:- import_module xml:doc, io, string, array, pprint, list.

(It's unfortunate that Tom chose to use `:' as the module qualifier,
rather than the more standard `__'.  This prompts me to push for the
change to `.' as *the* module qualifier soon.)

+% N.B. the dispatch function MUST call transform_xml_children (below),
+%      if one wishes to continue iteration (which is a good idea).
+% No matching must also be handled by the client!

I think this is the wrong design decision.  I strongly feel that client
code should perform any nsDocument -> nsDocument translations itself and
then use a generic XML pretty printer.

It's also unfortunate that the client has to call transform_xml_children
rather than just passing one or two transformation predicates that are
handled by pprint_xml.

+:- pred pprint_xml(document,
+                   pred(string, pred(element, array(content), pprint__doc)),
+		   state, state).
+:- mode pprint_xml(in, pred(in, free -> pred(in, in, out) is det) is cc_multi,
+                   di, uo) is det.
+:- pred transform_xml_children(list(ref(content)), array(content),
+                   pred(string, pred(element, array(content), pprint__doc)),
+                   pprint__doc).
+:- mode transform_xml_children(in, in,
+                   pred(in, free -> pred(in, in, out) is det) is cc_multi,
+                   out) is det.

1. When you have `pred(in, in, out) is det' insts, the right things to use
are functions rather than predicates.  This way you can just use
ordinary `in' rather than complicated modes (this has been in the ROTDs
for some time).

2. Even if you have to use a pred inst, it's far better in terms of
readability and maintainability to use the parametric mode `out(...)'
rather than `free -> ...'.

3. The cc_multin determinisms are worrying (and wrong, as we'll see.)

+:- implementation.
+:- import_module std_util, int.
+
+pprint_xml(Doc, Pred) -->
+  { Doc1 = promise_only_solution(convert_to_doc(Doc, Pred)) },
+  pprint__write(80, Doc1).

Why not write the sub-expression in-line rather than first binding it to
a variable?

+transform_xml_children(List, Content, Pred,
+    nest(3, separated(dispatching(Content, Pred), line, List))).
+
+:- func dispatching(array(content),
+                    pred(string, pred(element, array(content), pprint__doc)),
+		    int)
+             = pprint__doc.
+:- mode dispatching(in, pred(in, free -> pred(in, in, out) is det) is cc_multi,
+                    in) = out is det.
+dispatching(Contents, Map, Elt) =
+  promise_only_solution(dispatch_on(lookup(Contents, Elt), Map, Contents)).

Unfortunately, that promise_only_solution is not legitimate.
promise_only_solution means just that, here it's being used more like a
a red cut in Prolog.  Which is bad.

+% pprints the document starting from the root.
+:- pred convert_to_doc(document,
+                    pred(string, pred(element, array(content), pprint__doc)),
+                    pprint__doc).
+:- mode convert_to_doc(in, pred(in, free -> pred(in, in, out) is det) is cc_multi,
+			out) is cc_multi.
+
+convert_to_doc(XML, Map, Doc) :-
+  dispatch_on(lookup(XML ^ content, XML ^ root), Map, XML ^ content, Doc);
+  Doc = text("Could not convert XML doc.").

It's better to layout disjunctions as

	(
		Case1
	;
		Case2
	;
		...
	)

or even better in this case, use clausal form:

convert_to_doc(XML, Map, Doc) :-
	dispatch_on(lookup(XML ^ content, XML ^ root), Map, XML ^ content, Doc).

convert_to_doc(_,   _,   text("Could not convert XML doc.")).

This is where the first explanation for your cc_multis becomes apparent
and is also a serious bug.

This is a disjunction and the compiler is quite free to reorder them as
it sees fit.  One of the things this predicate says is that text("Could
not convert XML doc.") is a legitimate solution for *any* bit of XML.
That is, the compiler could return this for *all* arguments!  This
clearly isn't what you mean.

+% takes action (a la transformation) on elements.
+:- pred dispatch_on(content,
+                    pred(string, pred(element, array(content), pprint__doc)),
+                    array(content), pprint__doc).
+:- mode dispatch_on(in, pred(in, free -> pred(in, in, out) is det) is cc_multi,
+                    in, out) is cc_multi.
+
+dispatch_on(data(String), _Map, _Content, word_wrapped(String)).
+dispatch_on(element(Element), Map, Content, Doc) :-
+  Name = Element ^ eName,
+  % Pred = promise_only_solution(Map(Name)),
+  Map(Name, Pred),
+  Pred(Element, Content, Doc). 

Why the two-step process?  Why not just Map(Name, Element, Content, Doc)?

+dispatch_on(_Elt, _Map, _Content, text("")).

Here we have exactly the same problem.

The thing is that since we know all the constructors for nsElement, we
can simply write a switch that handles all possible cases.  This would
make it a det function, which is what we want, and do away with the need
for all those dodgy calls to promise_only_solution.

The other problem with this is that it isn't really a document
formatter!  It's more an XML tree traversal algorithm.  Which suggests
that it might be made more general and put in a differently named
module.

> One caveat, the pretty printer has not been tested with namespaces (it 
> shouldn't take much to have it working properly with namespace, however); 
> otherwise, it works fine for me.  Please feel free to make or to recommend 
> any changes to include it with the XML library.


Index: tryit.m
[...]
+	    (
+	      	{ Res = ok((_DTD, Doc)) },
+	    	% { nsTranslate(Doc, NsDoc) }, 
+		% { New = cat:ok((DTD, NsDoc)) },
+                % write(New)
+	    	% if don't want to turn the doc to namespace awared, 
+		% change the above three lines to  
+	  	% write(Res) 
+		pprint_xml(Doc, xform)

The convention is to write a function that turns whatever (nsDocuments
in this case) into pprint__docs and then pass those to
pprint__write/[2,3].

[...]
+:- pred xform(string, pred(element, array(content), pprint__doc)).
+:- mode xform(in, free -> pred(in, in, out) is det) is cc_multi.
+
+% Here's a tranformer that prints the element names 
+
+% xform(_, (pred(Elt::in, Ary::in, Doc::out) is det
+%	    :- transform_xml_children(Elt ^ eContent, Ary, xform, Doc1),
+%               Doc = group(text(format("Element %s: ", [s(Elt ^ eName)]))
+% 	 	     `<>` Doc1))).
+
+
+
+% Here's a set of transformers that convert the newsarticle xml to HTML.
+% The "ARTICLE" xformer uses the xml attribute for outputting the date.
+% The "INTERNAL-USE" xformer hides that tag's text.
+
+xform("NEWSARTICLE",
+      (pred(Elt::in, Ary::in, Doc::out) is det :-
+         Intro = separated(text, line, 
+                           ["<html>", " <title>News Articles</title>",
+			    " <body bgcolor='white'>", "  <center>",
+			    "    <h2>News Articles</h2>", "  </center>"]),
+         transform_xml_children(Elt ^ eContent, Ary, xform, Doc1),
+         Close = separated(text, line, [" </body>", "</html>"]),
+         Doc = Intro `</>` Doc1 `</>` Close)).
+xform("ARTICLE",
+      (pred(Elt::in, Ary::in, Doc::out) is det :-
+         Intro = text(format("<p><em>Dated %s</em></p>",
+		             [s(index0_det(Elt ^ eAttrs, 0) ^ aValue)])),
+         transform_xml_children(Elt ^ eContent, Ary, xform, Doc1),
+         Doc = Intro `</>` Doc1)).
+xform("INTERNAL-USE",
+      (pred(Elt::in, Ary::in, Doc::out) is det :- Doc = text(""))).
+xform("TITLE",
+      (pred(Elt::in, Ary::in, Doc::out) is det :-
+         Intro = text("<h3>"),
+ 	 transform_xml_children(Elt ^ eContent, Ary, xform, Doc1),
+  	 Doc = Intro `<>` Doc1 `<>` text("</h3>") `<>` line)).
+xform("BODY",
+      (pred(Elt::in, Ary::in, Doc::out) is det :-
+         Intro = text("<p>"),
+ 	 transform_xml_children(Elt ^ eContent, Ary, xform, Doc1),
+  	 Doc = Intro `<>` Doc1 `<>` text("</p>") `<>` line)).
+xform(_, (pred(Elt::in, Ary::in, Doc::out) is det :-
+            transform_xml_children(Elt ^ eContent, Ary, xform, Doc))).

I think this would be far better handled by a preliminary nsDocument ->
nsDocument transformation (which would also be more general) rather than
an immediate nsDocument -> pprint__doc transformation.

The other thing that comes to mind is that this looks like the sort of
thing a small combinator library would do well, as an alternative
approach.

This has prompted me to knock out a generic nsDocument -> pprint__doc
formatter, however.  There are a number of things that need fixing, so
if anyone can offer help or criticism (at the moment I'm not
sufficiently in need of an XML formatter to spend half a day chasing up
the movable feast that is the XML standard... :) I'd be very grateful.

(By the way Tom, why are the ref values one greater than their content
array indices?  I didn't see any documentation on that one and I'm not
sure my fix of subtracting one isn't just a hack.)

(I've just thought of another problem with my solution: when should a tag body
appear verbatim rather than formatted?)

+%------------------------------------------------------------------------------%
+% xml.pprint.m
+% Ralph Becket <rafe at cs.mu.oz.au>
+% Wed Jun 12 16:45:23 EST 2002
+% vim: ft=mercury ff=unix ts=4 sw=4 et wm=0 tw=0
+%
+% Pretty printer formatting for nsDocument values.
+%
+% XXX The following need fixing:
+% XXX - names are not printed with namespace qualifiers;
+% XXX - space around tag bodies may be significant
+% XXX   (e.g. I don't know if `<foo>X</foo>' may be different to
+% XXX   `<foo> X </foo>'), if it is then the formatting rules will
+% XXX   need some fixing;
+% XXX - special characters in strings or text bodies are not escaped
+% XXX   (e.g. `"'s are not escaped);
+% XXX - at the moment the prestuff and poststuff fields are not
+% XXX   formatted - should they be?
+% XXX - no attempt at laziness has been made, which may be necessary
+% XXX   for formatting very complex XML documents.
+%
+%------------------------------------------------------------------------------%
+
+:- module xml__pprint.
+
+:- interface.
+
+:- import_module pprint.
+:- import_module xml__ns.
+
+
+
+:- instance doc(nsDocument).
+
+%------------------------------------------------------------------------------%
+%------------------------------------------------------------------------------%
+
+:- implementation.
+
+:- import_module list, int, string, array.
+:- import_module xml__doc.
+
+%------------------------------------------------------------------------------%
+
+:- instance doc(nsDocument) where [
+
+        % Just format the entire XML document (is there any reason to
+        % want a depth-limited version?)
+        %
+    doc(_, XMLDoc) =
+        xml_doc(XMLDoc ^ content, XMLDoc ^ content ^ elem(XMLDoc ^ root - 1))
+].
+
+
+:- type nsContents   == array(nsContent).
+
+:- type ref          == ref(nsContent).
+
+:- type refs         == list(ref).
+
+:- type nsAttributes == list(nsAttribute).
+
+%------------------------------------------------------------------------------%
+
+:- func indent(T) = doc <= doc(T).
+
+indent(XMLDoc) = nest(2, XMLDoc).
+
+%------------------------------------------------------------------------------%
+
+:- func xml_doc(nsContents, nsContent) = doc.
+
+xml_doc(_Contents, data(String)) =
+    group(word_wrapped(String)).
+
+xml_doc(_Contents, pi(Target, Data)) =
+    group("<?" ++ Target ++ space ++ indent(line ++ Data) ++ line ++ " ?>").
+
+xml_doc(_Contents, comment(String)) =
+    group("<!-- " ++ indent(line ++ word_wrapped(String)) ++ line ++ " -->").
+
+xml_doc(Contents, nsElement(Elt)) =
+    elt_doc(
+        Contents,
+        Elt ^ eName,
+        Elt ^ eAttrs,
+        Elt ^ eContent,
+        Elt ^ eNamespaces
+    ).
+
+%------------------------------------------------------------------------------%
+
+:- func elt_doc(nsContents, qName, nsAttributes, refs, nsList) = doc.
+
+elt_doc(Contents, QName, Attrs, Data, Namespaces) =
+    group(
+        line ++
+        group(
+            "<" ++ qname_doc(Namespaces, QName) ++
+            ( if Attrs = [] then nil else space ) ++
+            indent(
+                separated(attr_doc(Namespaces), space_line, Attrs)
+            ) ++
+            line ++
+            doc( if Data = [] then "/>" else ">" )
+        ) ++
+        indent(line ++ separated(datum_doc(Contents), nil, Data)) ++
+        line ++
+        ( if   Data = []
+          then nil
+          else "</" ++ qname_doc(Namespaces, QName) ++ ">"
+        )
+    ).
+
+%------------------------------------------------------------------------------%
+
+    % XXX How should fully qualified names be printed?
+    %
+:- func qname_doc(nsList, qName) = doc.
+
+qname_doc(_Namespaces, QName) = doc(QName ^ localName).
+
+%------------------------------------------------------------------------------%
+
+:- func attr_doc(nsList, nsAttribute) = doc.
+
+attr_doc(Namespaces, Attr) =
+    group(
+        line ++
+        qname_doc(Namespaces, Attr ^ aName) ++ "=" ++
+            indent(("\"" ++ Attr ^ aValue ++ "\"") `with_type` string)
+    ).
+
+%------------------------------------------------------------------------------%
+
+:- func datum_doc(nsContents, ref) = doc.
+
+datum_doc(Contents, Ref) = xml_doc(Contents, Contents ^ elem(Ref - 1)).
+
+%------------------------------------------------------------------------------%
+%------------------------------------------------------------------------------%

Here's a sample of the output:

<NEWSARTICLE>
  
  <ARTICLE DATE="10/05/00">
    <TITLE>Events and Reports</TITLE>
    <BODY>
      Two new sections have been added to the web site. The events section 
      will give information about events Mercury has been involved in -- we 
      start this section with a page on Mercury coming 4th (of 38 teams) in 
      the ICFP 2000 programming contest. The reports section contains regular 
      reports from the developers, such as minutes from Mercury meetings. Both 
      are available from the sidebar menu.
    </BODY>
  </ARTICLE>
  <ARTICLE DATE="10/04/00">
    <TITLE>Two new papers</TITLE>
    <BODY>
      Two new papers on Mercury are now available from our papers page. One 
      describes a binding-time analysis for higher order code, while the other 
      describes an analysis for detecting whether a memory cell is availble 
      for reuse.
    </BODY>
  </ARTICLE>
</NEWSARTICLE>

...compared with the original...

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE NEWSARTICLE SYSTEM "newsarticles.dtd">
<NEWSARTICLE>
<ARTICLE DATE="10/05/00">
    <TITLE>Events and Reports</TITLE>
    <BODY>Two new sections have been added to the web site. The events section will give information about events Mercury has been involved in -- we start this section with a page on Mercury coming 4th (of 38 teams) in the ICFP 2000 programming contest. The reports section contains regular reports from the developers, such as minutes from Mercury meetings. Both are available from the sidebar menu.</BODY>
</ARTICLE>
<ARTICLE DATE="10/04/00">
    <TITLE>Two new papers</TITLE>
    <BODY>Two new papers on Mercury are now available from our papers page. One describes a binding-time analysis for higher order code, while the other describes an analysis for detecting whether a memory cell is availble for reuse.</BODY>
</ARTICLE>
</NEWSARTICLE>
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
Previous message: [m-rev.] For review: xml pretty printer
Next message: [m-rev.] trivial diff: fix asm_fast.gc.par compilation error
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list