[m-rev.] Re: for review: parsing_utils improvements

Ian MacLarty maclarty at csse.unimelb.edu.au
Wed Sep 30 12:12:14 AEST 2009


On Tue, Sep 29, 2009 at 5:25 PM, Ralph Becket <rafe at csse.unimelb.edu.au> wrote:
> Ian MacLarty, Tuesday, 29 September 2009:
>> On Tue, Sep 29, 2009 at 4:02 PM, Ralph Becket <rafe at csse.unimelb.edu.au> wrote:
>> > Ian MacLarty, Tuesday, 29 September 2009:
>> >> > Having said that, aren't grammars like this fairly rare?  I mean,
>> >> > can the mutable stuff be separated out just for the parsers that need
>> >> > it?
>> >>
>> >> I don't know. What did you have in mind?
>> >
>> > Well, my sparql parser might simply define its own mutable and implement
>> > its own fail_with_error predicate.  The position only needs to be
>> > recorded at the point where fail_with_error is called.  If the parser
>> > succeeds at a point further than that stored in the mutable, then the
>> > error therein can be ignored.
>> >
>>
>> Yes I suppose you could do that, but I think that puts too much burden
>> on the parser writer since they would have to write all the impure
>> code to manage the mutable.  The point of this change is to put that
>> behind an abstraction barrier so users don't shoot themselves in the
>> foot.
>>
>> Also with this change you don't actually have to call
>> fail_with_message at all to get a usable error report.  Even if you
>> never call fail_with_message you can still get the furthest offset the
>> parser got to before it failed, which is a lot more useful than no
>> information at all.
>> The idea is that users could write their parser in a naive style and
>> still end up with something that is usable.
>
> My main concern with this change is that I'm not convinced it's the
> right way to do things.  Sticking mutables in there always starts
> alarm bells ringing.  I may be wrong.
>

I don't see any issues with the way I've used mutables here.  The
promise_pures arounds the spots where the mutable is updated look
correct to me.  Yes these predicates have the side effect of updating
some unseen state, but you cannot look at that state from anywhere
else in your program, so they make no difference to the declarative
semantics of these predicates or the rest of the program.  The only
predicate that looks at the mutables is parse/3 and that has a well
defined declarative semantics.  It's sound w.r.t those semantics, but
not complete, so we need the cc_multi determinism.  Do you see any
other issues with the way I've used mutables here?

> Here's the thing: say I have a rule
>
> A ::= B | C | D
> B ::= B1 B2 B3
> ...
>
> If I'm parsing A and B1 succeeds, then, in virtually all grammars
> I've seen, if B2 B3 fails then I know this is an error and I know
> C and D shouldn't match (if they can then I have a fairly unpleasant
> grammar...).
>
> The question now is, if B2 B3 fails, what should I do?  My first
> thought is throw an exception.
> Exception handlers further up the
> call stack can try advancing the offset until the next sensible
> starting point if we want to search for more errors.
>

I'm not keen on the exceptions approach.  Here's why:

1) To get any errors at all you would have to explicitly add in error
handling code.  That's not the case with the mutables approach.  For
example, you could write a parser for B above as:

b(Src, [B1, B2, B3]) -->
    b1(Src, B1),
    b2(Src, B2),
    b3(Src, B3).

and *still* get a useful error message if it fails.  I've already
found this useful.  For example when I was writing some input to
benchmark the sparql parser I had some syntax errors that I hadn't
explicitly handled in the code.  However the parser was still able to
tell me exactly where my errors were.  If your input is large this is
essential.

2) I don't like the idea of using exceptions for control flow.
Exceptions are meant for situations that are exceptional and which you
don't expect to occur very often.  When you write a parser you
normally expect the input to contain some errors.

3) Using exceptions is problematic when you want to reuse parts from
one parser in another parser.  For example a lot of languages that
deal with RDF data use the following syntax for URIs:

    <http://mydomain.com/rdf/myrdffile.rdf#mynode>

so it makes sense to write a predicate to parse URIs that can be
shared between parsers.  Now if we were writing such a predicate using
the exceptions approach we would probably do it by looking at the
first character and failing if it was not '<'.  If the first character
was '<' we would assume it must be a URI and try to parse the rest of
it.  If we subsequently failed (for example if we encountered a
character that is not allowed in a URI) we would throw an exception.
The exception would contain the offset of the illegal character.  The
signature of our predicate would look something like the following:

   % This fails if the first character is not '<' and throws an
exception if the first character is '<' but there
   % is some other error in the URI.
:- pred parse_uri(src::in, string::out, ps::in, ps::out) is semidet.

This predicate is a liability because callers have to be aware that it
might throw an exception for input that doesn't violate any of its
preconditions.  If you're always going to use it in parsers that have
a try at their top level, then it's not such a problem, but there's no
way to ensure that it's always used that way.  Also the predicate has
to make an assumption about the languages that it's going to be used
in. It has to assume that any token that starts with a '<' must be
parsed as a URI.  It would be nice if it didn't have to make this
assumption.  For example if I wanted to make tokens of the form
<<token>> have a different meaning in my language.

If you implemented it using the mutables approach then the semantics
of parse_uri is much simpler and clearer: it fails when the input is
not a URI and succeeds when it is.  No assumptions about the language
that uses it have to be made.  And you don't have to worry about
handling any exceptions it might throw.  And even if it doesn't call
fail_with_message it will still be able to tell you the position of
illegal characters in invalid URIs.

> The other problem I have with this approach is that when I want to
> report an error, I usually want to provide more context.  For example,
> I'd like to say
>
>        In application of foldl
>        In argument 1
>        This is not a function.
>

Wouldn't you report this kind of error after parsing?  Does the
exception approach make this easier?

In any case you could still use the exception approach if you wanted to.

Here are some benchmarks I did with the sparql parser with an input of
4.8M (times are an average of 3 runs):

naive:       6.7s
2mutables:   4.8s
flag_on:     5.0s
flag_off:    4.1s
no_mutables: 3.7s

naive is my original diff.
2mutables has one mutable with only the offset which is updated often
and another mutable with an offset and an error message that is only
updated when fail_with_message is called.
flag_on uses a flag to decide whether to update the mutables or not
(the flag is on)
flag_off is the same as flag_on, but with the flag turned off (so no
mutables updated)
no_mutables is basically the current version (i.e. never update any mutables)

Ian.

--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list