[m-users.] Determinism issue with parser

Fabrice Nicol fabrnicol at gmail.com
Mon Jul 5 05:17:22 AEST 2021


Hi,
The latest conversation between emacstheviking and Zoltan about how the
compiler handles determinism in disjuncts is quite illuminating and useful
to learners, I think. I went through all this about a year ago when I began
learning Mercury more seriously and I think all learners do or will
experience such situations.
I think it would be a Good Idea to add a separate practical example to J.
Fondren's excellent crash course to drive the point home.
Perhaps it would also help to strongly underline that the way Mercury works
in disjuncts is a design consequence of the elimination of prolog cuts, and
not just a form of compiler blindness. I now that prolog is not the core
interest of the Mercury dev team but practically people tackling Mercury
are coming from a prolog experience and the prolog to Mercury transition
guide is a bit short on practical issues in this respect.
Having read R. O'Keefe's The Craft of Prolog in the past does help
(Mercury's design is a direct response to prolog shortcomings reviewed in
this book), but not all learners read it and a few more Mercury examples
would not harm.
Fabrice


Le dim. 4 juil. 2021 à 7:34 PM, <users-request at lists.mercurylang.org> a
écrit :

> Send users mailing list submissions to
>         users at lists.mercurylang.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.mercurylang.org/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         users-request at lists.mercurylang.org
>
> You can reach the person managing the list at
>         users-owner at lists.mercurylang.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: Determinism issue with parser (Sean Charles (emacstheviking))
>    2. Re: Determinism issue with parser (Zoltan Somogyi)
>    3. Re: Determinism issue with parser (Volker Wysk)
>    4. Re: Determinism issue with parser (Zoltan Somogyi)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 4 Jul 2021 17:31:59 +0100
> From: "Sean Charles (emacstheviking)" <objitsu at gmail.com>
> To: Zoltan Somogyi <zoltan.somogyi at runbox.com>
> Cc: users <users at lists.mercurylang.org>
> Subject: Re: [m-users.] Determinism issue with parser
> Message-ID: <F5C1D814-E1A5-46C5-BE48-E938C47B7C7E at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Zoltan, your replies have always helped and continue to do so.
>
> If by "one" you mean one determinism declaration, they why do you
> "change one and recompile"?
>
> The answer; simple frustration at being fully cognisant of my own
> ignorance, that’s why.
>
> I —understand— the reasons why it is telling me I am wrong but then chaos
> seems to erupt inside my head. It’s —my— decision as to the determinism of
> the code I write, IIUIC that’s the point of declaring it, then the compiler
> tells me otherwise when it knows better (often).
>
> As for using solutions/1, it crossed my mind but for me that’s a biiig
> code smell and I refused to do it.
>
> Part of my confusion is understanding what it means to have more than one
> solution if the inputs are the same?!?! How can that be? For example, in
> this predicate where I check for a body form:
>
> :- pred parse_body_term(snode::out, astcon::in, astcon::out) is multi.
>
> parse_body_term(Term, !A) :-
>     (
>         parse_atom(Term, !A)
>     ;
>         parse_list(Term, !A)
>     ;
>         parse_map(Term, !A)
>     ;
>         parse_feltcall(Term, !A)
>     ;
>         parse_fncall(Term, !A)
>     ;
>         parse_sexp(Term, !A)
>     ).
>
> I thought that this was semidet because either one of those disjuncts
> succeed or they all fail, I don’t understand why it says multi, multi is
> ‘can’t fail, >1 solutions’ but if I consume tokens in the parse_map()
> disjunct branch, how can it then come back for more? I just don’t get it,
> that’s something that really confuses me at the moment. Hence my question
> ‘what does it mean to have more than one solution?’.
>
> det: for example ‘*’, given x*y, the result is deterministic.
> semidet: for example, given x/y, if y is nonzero the result is
> deterministic, if y is zero presumably it fails. pedagogical usage here
> that’s all.
>
> for me the confusion starts with `multi`, to have more than one solution
> surely it has to be called more than once with different inputs or is
> stepping through a set of facts?
>
> I think if I had managed to get mdb to work it might help to single step
> the code but I tried for a few hours and emacs version was wrong, lots of
> problems etc so I went back to ’thinking’ about it and trace[] all over the
> place.
>
> Not a good place to be at the moment but things worthwhile are never easy
> to attain.
>
> Thanks.
>
>
> > On 4 Jul 2021, at 16:56, Zoltan Somogyi <zoltan.somogyi at runbox.com>
> wrote:
> >
> >
> > 2021-07-05 01:21 GMT+10:00 "Volker Wysk" <post at volker-wysk.de>:
> >> I've built a big parser for my project, too. It should always produce
> one
> >> solution; it can't fail. But the compiler can't infer that it is
> >> deterministic, for many parts of it. (That would be undecidable.)
> >>
> >> So I've resigned to this state of affairs. I just collect all the
> (exactly
> >> one) solutions with solutions/2 and produce an error message in case
> there
> >> should be more than one, which would be a bug in my parser.
> >
> > That works, but I would consider it fragile. The reason is that
> convincing
> > someone (the next maintainer of the parser, perhaps) that there will be
> > exactly one overall solution requires a single correctness argument over
> > the entire code, and humans are not very reliable as reasoning machines;
> > the longer the argument, the higher the probability of error.
> >
> > I would instead add disambiguation code to the places in the parser
> > than can introduce (as far as the compiler is concerned) the possibility
> > of nondeterminism. You can you can see why there is no such possibility;
> > I would try  to make the compiler that too. In most cases, this is not
> too hard,
> > at least in my experience.
> >
> > This approach replaces a single bug, global correctness argument with
> lots of
> > small, local correctness arguments, which can be understood, documented,
> > and if need be fixed, much more easily. Modularity in correctness
> arguments,
> > if you will.
> >
> > Zoltan.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mercurylang.org/archives/users/attachments/20210704/043b2328/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Mon, 05 Jul 2021 03:09:42 +1000 (AEST)
> From: "Zoltan Somogyi" <zoltan.somogyi at runbox.com>
> To: "Sean Charles" <objitsu at gmail.com>
> Cc: users <users at lists.mercurylang.org>
> Subject: Re: [m-users.] Determinism issue with parser
> Message-ID: <E1m05ck-0002LB-Kd at rmmprod06.runbox>
> Content-Type: text/plain; charset="utf-8"
>
>
> 2021-07-05 02:31 GMT+10:00 "Sean Charles (emacstheviking)" <
> objitsu at gmail.com>:
> > Part of my confusion is understanding what it means to have more than
> one solution if the inputs are the same?!?!
>
> I am not sure what you are trying to say with "if the inputs are the same".
> Determinism is about the number of solutions you can get *for a given set
> of input values*.
>
> > How can that be? For example, in this predicate where I check for a body
> form:
> >
> > :- pred parse_body_term(snode::out, astcon::in, astcon::out) is multi.
> >
> > parse_body_term(Term, !A) :-
> >     (
> >         parse_atom(Term, !A)
> >     ;
> >         parse_list(Term, !A)
> >     ;
> >         parse_map(Term, !A)
> >     ;
> >         parse_feltcall(Term, !A)
> >     ;
> >         parse_fncall(Term, !A)
> >     ;
> >         parse_sexp(Term, !A)
> >     ).
> >
> > I thought that this was semidet because either one of those disjuncts
> succeed or they all fail,
>
> You may know that, but the compiler does not. As I mentioned in my
> previous email,
> the compiler does not even *try* to figure out whether the success of one
> of these calls
> rules out the success of the other calls. Since I presume none of the
> called predicates
> are declared to have determinism "failure" or "erroneous", it simply
> assumes that
> they can all succeed. As far as the compiler is concerned, any given call
> to parse_body_term
> can generate six solutions, one from each disjunct, even if each call
> generates a maximum of one solution.
>
> > I don’t understand why it says multi, multi is ‘can’t fail, >1
> solutions’ but if I consume tokens
> > in the parse_map() disjunct branch, how can it then come back for more?
>
> There are two answers to that.
>
> First, a call to parse_x, whatever x is, may consume tokens in the sense
> of stepping over them,
> but this has effect only during forward execution. Upon backtracking, the
> value of !.A that the
> next disjunct will start with *will* be the same as its original value on
> entry to the disjunction,
> and thus will contain the "consumed" tokens as well as all the following
> tokens.
>
> Second, determinism analysis does not know anything about tokens, consumed
> or not,
> and would not care if it did. The predicates called in the later disjuncts
> have declared modes
> that say they can succeed, so *regardless of anything else*, determinism
> analysis assumes
> they can succeed. That's it. Determinism analysis of disjunctions is
> *that* simple.
>
> > I just don’t get it, that’s something that really confuses me at the
> moment. Hence my question ‘what does it mean to have more than one
> solution?’.
>
> Section 6.2 gives the *complete* algorithm.
>
> Your problem is that you attributing to determinism analysis
> powers and reasoning skills it simply does not have.
>
> In the case of the above disjunction, if there is a later disjunct than
> e.g. the one that calls parse_atom,
> then the compiler will assume it *can* succeed and generate a solution.
> That's it; there is
> nothing more. Determinism analysis does not know, nor does it care,
> whether *you* know,
> or at least believe, that no later disjunct would generate a solution.
>
> > det: for example ‘*’, given x*y, the result is deterministic.
> > semidet: for example, given x/y, if y is nonzero the result is
> deterministic, if y is zero presumably it fails. pedagogical usage here
> that’s all.
>
> I have no idea what you are trying to say here. In any case, integer
> division is det, though it throws
> an exception if the divisor is zero.
>
> > for me the confusion starts with `multi`, to have more than one solution
> surely it has to be called more than once with different inputs
>
> No, that is wrong. Determinism is about how many solutions you can get for
> *one* call.
> Given a predicate p(in, out) whose definition is
>
> p(1, 11).
> p(1, 12).
>
> the one call p(1, X) will generate two solutions, X = 11 and X = 12.
>
> > or is stepping through a set of facts?
>
> Determinism analysis does not know, or care, whether a solution comes from
> a fact or not.
>
> > I think if I had managed to get mdb to work
>
> You can't run mdb until your program has been compiled, and it won't
> compile until
> you fix all the determinism errors.
>
> I see two ways to fix your code above. The easier but worse approach would
> to write
> something like
>
> parse_body_term(Term, !A) :-
>      ( if  parse_atom(TermPrime, !A) then
>           Term = TermPrime
>     else if parse_list(TermPrime, !A) then
>           Term = TermPrime
>     else if parse_map(Term, !A) then
>     ...
>     else
>           parse_sexp(Term, !A)
>    ).
>
> This will tell the compiler that if you can find e.g. a list in the input,
> then you want to keep the list, and not look for any more solutions.
>
> The harder but better solution is available only if the sets of tokens
> that atoms, lists, maps etc can start with is disjoint. If that is the
> case,
> then I would write parse_body_term as (a) get the first token, then
> (b) switch on that token.
>
> The second approach is always available in LL(1) grammars, but
> of course not all grammars are LL(1).
>
> Zoltan.
>
> ------------------------------
>
> Message: 3
> Date: Sun, 04 Jul 2021 19:21:11 +0200
> From: Volker Wysk <post at volker-wysk.de>
> To: zoltan.somogyi at runbox.com, Sean Charles <objitsu at gmail.com>, users
>         <users at lists.mercurylang.org>
> Subject: Re: [m-users.] Determinism issue with parser
> Message-ID:
>         <ecfa13776f3097af6bfcb6e85d2a4db79a5a1488.camel at volker-wysk.de>
> Content-Type: text/plain; charset="utf-8"
>
> Am Montag, den 05.07.2021, 01:56 +1000 schrieb Zoltan Somogyi:
> > 2021-07-05 01:21 GMT+10:00 "Volker Wysk" <post at volker-wysk.de>:
> > > I've built a big parser for my project, too. It should always produce
> one
> > > solution; it can't fail. But the compiler can't infer that it is
> > > deterministic, for many parts of it. (That would be undecidable.)
> > >
> > > So I've resigned to this state of affairs. I just collect all the
> (exactly
> > > one) solutions with solutions/2 and produce an error message in case
> there
> > > should be more than one, which would be a bug in my parser.
> >
> > That works, but I would consider it fragile. The reason is that
> convincing
> > someone (the next maintainer of the parser, perhaps) that there will be
> > exactly one overall solution requires a single correctness argument over
> > the entire code, and humans are not very reliable as reasoning machines;
> > the longer the argument, the higher the probability of error.
>
> The parser has fallbacks built in. A paragraph that isn't something else,
> is
> a text paragraph. An inline (something which occurs inside a paragraph)
> that
> isn't something special, is a simple word. A bug, which would cause a
> special case not to be recognized, causes the respective text to end up in
> the fallback data.
>
> The special cases are all in if-then-else-if-then-else etc. constructs. So
> there can be only be one matching. It's a parser for the markup language of
> a wiki engine. I think this is robust because of the language being parsed
> is like that.
>
> But I'd love to have the compiler understand the fact that it's
> deterministic. :-)
>
> > I would instead add disambiguation code to the places in the parser
> > than can introduce (as far as the compiler is concerned) the possibility
> > of nondeterminism. You can you can see why there is no such possibility;
> > I would try  to make the compiler that too. In most cases, this is not
> too hard,
> > at least in my experience.
>
> Hmm... I use my meta-predicate "repeatmax0" (and some variations of it) a
> lot. But see further down first, you don't really need all the details:
>
>
> % Something repeated zero to several times. Return the collected return
> % values as a list.
>
> :- pred repeat0(pred(T, list(char), list(char)), list(T),
>                 list(char), list(char)).
> :- mode repeat0(pred(out, in, out) is det,     out, in, out) is multi.
> :- mode repeat0(pred(out, in, out) is semidet, out, in, out) is multi.
> :- mode repeat0(pred(out, in, out) is multi,   out, in, out) is multi.
> :- mode repeat0(pred(out, in, out) is nondet,  out, in, out) is multi.
>
>
> % The longest sequence of repetitions of the specified thing, which returns
> % something. Its return values are returned as a list. Succeeds for zero
> % repetitions (with an empty list).
>
> :- pred repeatmax0(pred(T, list(char), list(char)), list(T),
>                    list(char), list(char)).
> :- mode repeatmax0(pred(out, in, out) is det,     out, in, out) is nondet.
> :- mode repeatmax0(pred(out, in, out) is semidet, out, in, out) is nondet.
> :- mode repeatmax0(pred(out, in, out) is multi,   out, in, out) is nondet.
> :- mode repeatmax0(pred(out, in, out) is nondet,  out, in, out) is nondet.
>
>
> The compiler told me that repeatmax0 is nondet. The implementation is
> simple:
>
>
> repeat0(_, []) --> [].
>
> repeat0(Pars, Values) -->
>     { Values = [Val|Values1] },
>     Pars(Val),
>     repeat0(Pars, Values1).
>
> repeatmax0(Pars, Values) -->
>     repeat0(Pars, Values),   % XXX
>     ( if   not Pars(_)
>       then []
>       else end
>     ).
>
>
> And there's the parser for the end of the input:
>
>
> :- pred end(list(char)::in, list(char)::out) is semidet.
>
> end([], []).
>
>
> The compiler told me, that the call to repeat0 (marked with "% XXX" above)
> can succeed more than once, which is correct.
>
> Taking your hint about disambiguation code, I was able to reformulate the
> repeatmax0 predicate such that, repeatmax0 becomes deterministic for a
> deterministic input parser:
>
>
> :- mode repeatmax0(pred(out, in, out) is det, out, in, out) is det.
>
> repeatmax0(Pars, Values) -->
>     ( if   Pars(Val)
>       then { Values = [Val|Vals] },
>            repeatmax0(Pars, Vals)
>       else { Values = [] }
>     ).
>
>
> I'm still not sure what you mean by "disambiguation code", though.
>
>
> > This approach replaces a single bug, global correctness argument with
> lots of
> > small, local correctness arguments, which can be understood, documented,
> > and if need be fixed, much more easily. Modularity in correctness
> arguments,
> > if you will.
>
> Thanks!
>
> Volker
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 833 bytes
> Desc: This is a digitally signed message part
> URL: <
> http://lists.mercurylang.org/archives/users/attachments/20210704/672d4de7/attachment-0001.sig
> >
>
> ------------------------------
>
> Message: 4
> Date: Mon, 05 Jul 2021 03:34:16 +1000 (AEST)
> From: "Zoltan Somogyi" <zoltan.somogyi at runbox.com>
> To: "Volker Wysk" <post at volker-wysk.de>, "Sean Charles"
>         <objitsu at gmail.com>, "users" <users at lists.mercurylang.org>
> Subject: Re: [m-users.] Determinism issue with parser
> Message-ID: <E1m060W-0004Bs-VW at rmmprod06.runbox>
> Content-Type: text/plain; charset="utf-8"
>
>
> 2021-07-05 03:21 GMT+10:00 "Volker Wysk" <post at volker-wysk.de>:
> > The compiler told me that repeatmax0 is nondet. The implementation is
> > simple:
> >
> >
> > repeat0(_, []) --> [].
> >
> > repeat0(Pars, Values) -->
> >     { Values = [Val|Values1] },
> >     Pars(Val),
> >     repeat0(Pars, Values1).
> >
> > repeatmax0(Pars, Values) -->
> >     repeat0(Pars, Values),   % XXX
> >     ( if   not Pars(_)
> >       then []
> >       else end
> >     ).
>
> This is nondet because the compiler can't see any reason
> why the second and third clauses could not both generate
> a solution.
>
> > Taking your hint about disambiguation code, I was able to reformulate the
> > repeatmax0 predicate such that, repeatmax0 becomes deterministic for a
> > deterministic input parser:
> >
> >
> > :- mode repeatmax0(pred(out, in, out) is det, out, in, out) is det.
> >
> > repeatmax0(Pars, Values) -->
> >     ( if   Pars(Val)
> >       then { Values = [Val|Vals] },
> >            repeatmax0(Pars, Vals)
> >       else { Values = [] }
> >     ).
> >
> >
> > I'm still not sure what you mean by "disambiguation code", though.
>
> I meant any code that tells the compiler that there will be only one
> solution
> even though its old version would be able to generate more than one
> solution
> as far as the compiler is concerned.
>
> Your code immediately above does this: by replacing the original second
> and third clauses (an implicit disjunction) with an if-then-else, you told
> the compiler
> that there will be a solution from either then then-part or the else-part,
> BUT NOT BOTH.
> So it seems you understood my hint sufficiently well, even if you did not
> think so :-)
>
> Zoltan.
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
>
>
> ------------------------------
>
> End of users Digest, Vol 83, Issue 2
> ************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20210704/89af0826/attachment-0001.html>


More information about the users mailing list