[m-rev.] for review: better diagnostics for unbalanced parentheses

Peter Wang novalazy at gmail.com
Tue Jul 29 16:55:25 AEST 2025


On Mon, 28 Jul 2025 20:12:56 +0200 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone.
> 
> Zoltan.

> Generate better diagnostics for parentheses mismatches.
> 
> When you have an unclosed (, [ or { in a clause, the diagnostic
> you got did not tell you
> 
> - where the unclosed parenthesis was, 
> - which kind of parenthesis it was.
> 
> Fix this by including both pieces of information in the diagnostic.
> 
> Likewise, print more useful info for mixed-up parentheses,
> such as [(]).
> 
> library/mercury_term_parser.m:
>     When consuming a (, [ or { token, push it and its context on a stack.
>     When consuming a ), ] or } token, pop off the top item from this stack,
>     and generate a diagnostic if the close token does not match it.
>     The one exception from this pushing and pulling is for code that
>     handles the case where the open is followed immediately by
>     the matching close, such as when parsing [] or {}.
> 
>     Print the contents of the stack also when getting to either
>     the end of the term, or the end of the input, with a nonempty stack.
> 
>     Maintaining this stack has a small performance cost, but I expect
>     it to be negligible, especially compared to the usefulness
>     of the new detail in diagnostics,
> 
>     Completely rework the error handling parts of this module.
>     The main changes are the following.
>     
>     First, the old code used to include *part* of the intended message
>     in the pr_error structures it created, with a "Syntax error: "
>     prefix being added later. Since this makes it hard to ensure
>     that the error messages follow the rules of English, change this
>     to generate each error message all at once.
> 
>     Second, the old code included the list of the remaining tokens
>     in each pr_error structure. This was overkill, because the only part
>     of this list that was used was the id and the context of the
>     first token in the list. Apart from being inelegant, the main flaw
>     of this approach was that in the case of premature end-of-file
>     errors, the only token list available was token_nil, which
>     of course contains neither a token nor its context. The old code
>     compensated for it later by using the context of the *first* token
>     of the whole term being parsed, which is ... less than useful.
>     
>     The new code replaces the token list with the context, if it
>     is available; if it is not, then later we compute the context
>     of the last token in the whole token list. The new code
>     does not return the token itself; instead, it includes
>     its string vercion in the generated error message where appropriate.

s/vercion/version

> 
>     Third, we now consistently use "expected abc, got xyz" phraseology
>     where possible.

I prefer how the old messages stated the offending token at the start of
the message; I think that's the most important part. Telling me what the
parser was _expecting_, in particular "expected start of (sub)term)",
is not so useful.

> diff --git a/library/mercury_term_lexer.m b/library/mercury_term_lexer.m
> index 52e31f507..7912b6791 100644
> --- a/library/mercury_term_lexer.m
> +++ b/library/mercury_term_lexer.m

> @@ -216,6 +287,41 @@
>  % line context/position pairs. However, while we have optimizations
>  % that eliminate the overhead of generic calls in *most* cases, we cannot
>  % (yet) ensure the elimination of this overhead in *all* cases.
> +%
> +% A note about the use of subtypes: we represent tokens using three types,
> +% namely raw_token, non_id_token and token itself. The differences are
> +%
> +% - raw_token, the supertype, defines all tokens that the lexer deals with,
> +%   including the internal-use-only integer_dot and eof;
> +%
> +% - the non_id_token type, which is a subtype of raw_token that leaves out
> +%   integer_dot (the "id" in the name is short for "integer_dot"), and
> +%
> +% - the token type, which is a subtype of raw_token that leaves out both
> +%   integer_dot and eof.
> +%
> +% The get_* predicates use the raw_token type; that type was designed for
> +% their needs (though at that time, that type was named "token").
> +%
> +% The string_get_* and linestr_get_* predicates use the non_id__token type.

non_id_token

> +% They don't need to use the integer_dot token because pushback is
> +% not an issue for them.
> +%
> +% The final token list uses the token type, because it represents the end
> +% of the file as token_nil, i.e. the end fof the list, not as a token
> +% of any kind.

s/fof/of/

> +%
> +% Most auxiliary predicates that construct tokens mostly return them
> +% as values of the non_id__token type, even though they could return theem

non_id_token, them

> diff --git a/library/mercury_term_parser.m b/library/mercury_term_parser.m
> index c89b2dbb0..8acbae2f8 100644
> --- a/library/mercury_term_parser.m
> +++ b/library/mercury_term_parser.m

> @@ -939,27 +1004,33 @@ parse_args(List, !TokensLeft, !PS) :-
>                  ),
>                  (
>                      Tail0 = pr_ok(Tail),
> -                    List = pr_ok([Arg|Tail])
> +                    List = pr_ok([Arg | Tail])
>                  ;
> -                    Tail0 = pr_error(_, _),
> +                    Tail0 = pr_error(_),
>                      % Propagate error upwards.
>                      List = Tail0
>                  )
>              else if Token = close then
> -                List = pr_ok([Arg])
> +                pop_nest_open(close, Context, MaybeErrorMsg, !PS),
> +                (
> +                    MaybeErrorMsg = no,
> +                    List = pr_ok([Arg])
> +                ;
> +                    MaybeErrorMsg = yes(ErrorMsg),
> +                    List = pr_error(pr_error_ctxt(Context, ErrorMsg))
> +                )
>              else
> -                parser_unexpected_tok(Token, Context,
> -                    "expected `,', `)', or operator", List, !TokensLeft, !.PS)
> +                report_unexpected_token(Token, Context,
> +                    expected("`,', `)', or operator"), List, !TokensLeft, !.PS)
>              )
>          ;
>              !.TokensLeft = token_nil,
> -            List = pr_error("unexpected end-of-file in argument list",
> -                !.TokensLeft)
> +            report_unexpected_eof(expected("a comma or a ')'"), List, !.PS)

Open with a backtick: `)'

> @@ -1031,65 +1103,111 @@ parse_list_tail(Arg, List, !TokensLeft, !PS) :-
>              parse_arg(Tail0, !TokensLeft, !PS),
>              (
>                  Tail0 = pr_ok(Tail),
> -                ( if
> -                    !.TokensLeft = token_cons(close_list, _Context,
> -                        !:TokensLeft)
> -                then
> -                    List = pr_ok(term.functor(term.atom("[|]"), [Arg, Tail],
> -                        TermContext))
> -                else
> -                    parser_unexpected("expecting ']' or operator", List,
> -                        !TokensLeft, !.PS)
> +                (
> +                    !.TokensLeft = token_cons(NextToken, NextContext,
> +                        !:TokensLeft),
> +                    ( if NextToken = close_list then
> +                        pop_nest_open(close_list, Context, MaybeErrorMsg, !PS),
> +                        (
> +                            MaybeErrorMsg = no,
> +                            Term = term.functor(term.atom("[|]"), [Arg, Tail],
> +                                TermContext),
> +                            List = pr_ok(Term)
> +                        ;
> +                            MaybeErrorMsg = yes(ErrorMsg),
> +                            List = pr_error(pr_error_ctxt(Context, ErrorMsg))
> +                        )
> +                    else
> +                        report_unexpected_token(NextToken, NextContext,
> +                            expected("']' or operator"), List,
> +                            !TokensLeft, !.PS)

Open with a backtick: `]'

> +                    )
> +                ;
> +                    !.TokensLeft = token_nil,
> +                    report_unexpected_eof(expected("']' or operator"),
> +                        List, !.PS)
>                  )

Open with a backtick: `]'

>          else
> -            parser_unexpected_tok(Token, Context,
> -                "expected comma, `|', `]', or operator",
> +            report_unexpected_token(Token, Context,
> +                expected("comma, `|', `]', or operator"),
>                  List, !TokensLeft, !.PS)
>          )
>      ;
>          !.TokensLeft = token_nil,
>          % XXX The error message should state the line that the list started on.
> -        List = pr_error("unexpected end-of-file in list", !.TokensLeft)
> +        % ZZZ
> +        report_unexpected_eof(expected("',', '|' or ']'"), List, !.PS)
>      ).

ZZZ here.

Use backticks.

> diff --git a/tests/invalid_nodepend/unbalanced.err_exp b/tests/invalid_nodepend/unbalanced.err_exp
> index e69de29bb..f04ac815a 100644
> --- a/tests/invalid_nodepend/unbalanced.err_exp
> +++ b/tests/invalid_nodepend/unbalanced.err_exp
> @@ -0,0 +1,20 @@
> +unbalanced.m:013: Error: predicate `p'/2 has no clauses.
> +unbalanced.m:029: Syntax error: expected `)' or operator, got token `. ', with
> +unbalanced.m:029:   , and preceded by open '(' on line 21.

This message is faulty.

> +unbalanced.m:031: Error: predicate `q'/2 has no clauses.
> +unbalanced.m:038: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:038:   token `)' (there is an open '[' on line 37 between the open
> +unbalanced.m:038:   '(' on line 36 and the ')' here).

I find all the punctuation and symbols and line numbers hard to follow
and distracting.

Here is my suggestion:

    Syntax error at token ')': expected comma, `|', `]', or operator.
    There is an unclosed `[' on line 37.

I think it is enough to point out the location and identity of
the last unclosed bracket.

> +unbalanced.m:048: Error: predicate `r'/2 has no clauses.
> +unbalanced.m:051: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:051:   token `}' (there is an open '[' on line 51 between the open
> +unbalanced.m:051:   '{' on line 51 and the '}' here).
> +unbalanced.m:053: Error: predicate `s'/2 has no clauses.
> +unbalanced.m:056: Syntax error: expected start of (sub)term, got token `]'
> +unbalanced.m:056:   (there is no open '[' to close here).

My suggestion:

    Syntax error at token `]': expected start of (sub)term.
    There is no open `[' to close here.

> +unbalanced.m:058: Error: predicate `t'/2 has no clauses.
> +unbalanced.m:061: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:061:   token `. ', with open '[' on line 61, and preceded by open
> +unbalanced.m:061:   '(' on line 61.
> +unbalanced.m:064: Syntax error: expected ',', '|' or ']', got end-of-file, with
> +unbalanced.m:064:   open '[' on line 64, and preceded by open '(' on line 63.

My suggestion:

    Syntax error at end-of-file: expected `,', `|' or `]'.
    There is an unclosed `[' on line 64.

Peter


More information about the reviews mailing list