[m-rev.] for review: better diagnostics for unbalanced parentheses
Peter Wang
novalazy at gmail.com
Tue Jul 29 16:55:25 AEST 2025
On Mon, 28 Jul 2025 20:12:56 +0200 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone.
>
> Zoltan.
> Generate better diagnostics for parentheses mismatches.
>
> When you have an unclosed (, [ or { in a clause, the diagnostic
> you got did not tell you
>
> - where the unclosed parenthesis was,
> - which kind of parenthesis it was.
>
> Fix this by including both pieces of information in the diagnostic.
>
> Likewise, print more useful info for mixed-up parentheses,
> such as [(]).
>
> library/mercury_term_parser.m:
> When consuming a (, [ or { token, push it and its context on a stack.
> When consuming a ), ] or } token, pop off the top item from this stack,
> and generate a diagnostic if the close token does not match it.
> The one exception from this pushing and pulling is for code that
> handles the case where the open is followed immediately by
> the matching close, such as when parsing [] or {}.
>
> Print the contents of the stack also when getting to either
> the end of the term, or the end of the input, with a nonempty stack.
>
> Maintaining this stack has a small performance cost, but I expect
> it to be negligible, especially compared to the usefulness
> of the new detail in diagnostics,
>
> Completely rework the error handling parts of this module.
> The main changes are the following.
>
> First, the old code used to include *part* of the intended message
> in the pr_error structures it created, with a "Syntax error: "
> prefix being added later. Since this makes it hard to ensure
> that the error messages follow the rules of English, change this
> to generate each error message all at once.
>
> Second, the old code included the list of the remaining tokens
> in each pr_error structure. This was overkill, because the only part
> of this list that was used was the id and the context of the
> first token in the list. Apart from being inelegant, the main flaw
> of this approach was that in the case of premature end-of-file
> errors, the only token list available was token_nil, which
> of course contains neither a token nor its context. The old code
> compensated for it later by using the context of the *first* token
> of the whole term being parsed, which is ... less than useful.
>
> The new code replaces the token list with the context, if it
> is available; if it is not, then later we compute the context
> of the last token in the whole token list. The new code
> does not return the token itself; instead, it includes
> its string vercion in the generated error message where appropriate.
s/vercion/version
>
> Third, we now consistently use "expected abc, got xyz" phraseology
> where possible.
I prefer how the old messages stated the offending token at the start of
the message; I think that's the most important part. Telling me what the
parser was _expecting_, in particular "expected start of (sub)term)",
is not so useful.
> diff --git a/library/mercury_term_lexer.m b/library/mercury_term_lexer.m
> index 52e31f507..7912b6791 100644
> --- a/library/mercury_term_lexer.m
> +++ b/library/mercury_term_lexer.m
> @@ -216,6 +287,41 @@
> % line context/position pairs. However, while we have optimizations
> % that eliminate the overhead of generic calls in *most* cases, we cannot
> % (yet) ensure the elimination of this overhead in *all* cases.
> +%
> +% A note about the use of subtypes: we represent tokens using three types,
> +% namely raw_token, non_id_token and token itself. The differences are
> +%
> +% - raw_token, the supertype, defines all tokens that the lexer deals with,
> +% including the internal-use-only integer_dot and eof;
> +%
> +% - the non_id_token type, which is a subtype of raw_token that leaves out
> +% integer_dot (the "id" in the name is short for "integer_dot"), and
> +%
> +% - the token type, which is a subtype of raw_token that leaves out both
> +% integer_dot and eof.
> +%
> +% The get_* predicates use the raw_token type; that type was designed for
> +% their needs (though at that time, that type was named "token").
> +%
> +% The string_get_* and linestr_get_* predicates use the non_id__token type.
non_id_token
> +% They don't need to use the integer_dot token because pushback is
> +% not an issue for them.
> +%
> +% The final token list uses the token type, because it represents the end
> +% of the file as token_nil, i.e. the end fof the list, not as a token
> +% of any kind.
s/fof/of/
> +%
> +% Most auxiliary predicates that construct tokens mostly return them
> +% as values of the non_id__token type, even though they could return theem
non_id_token, them
> diff --git a/library/mercury_term_parser.m b/library/mercury_term_parser.m
> index c89b2dbb0..8acbae2f8 100644
> --- a/library/mercury_term_parser.m
> +++ b/library/mercury_term_parser.m
> @@ -939,27 +1004,33 @@ parse_args(List, !TokensLeft, !PS) :-
> ),
> (
> Tail0 = pr_ok(Tail),
> - List = pr_ok([Arg|Tail])
> + List = pr_ok([Arg | Tail])
> ;
> - Tail0 = pr_error(_, _),
> + Tail0 = pr_error(_),
> % Propagate error upwards.
> List = Tail0
> )
> else if Token = close then
> - List = pr_ok([Arg])
> + pop_nest_open(close, Context, MaybeErrorMsg, !PS),
> + (
> + MaybeErrorMsg = no,
> + List = pr_ok([Arg])
> + ;
> + MaybeErrorMsg = yes(ErrorMsg),
> + List = pr_error(pr_error_ctxt(Context, ErrorMsg))
> + )
> else
> - parser_unexpected_tok(Token, Context,
> - "expected `,', `)', or operator", List, !TokensLeft, !.PS)
> + report_unexpected_token(Token, Context,
> + expected("`,', `)', or operator"), List, !TokensLeft, !.PS)
> )
> ;
> !.TokensLeft = token_nil,
> - List = pr_error("unexpected end-of-file in argument list",
> - !.TokensLeft)
> + report_unexpected_eof(expected("a comma or a ')'"), List, !.PS)
Open with a backtick: `)'
> @@ -1031,65 +1103,111 @@ parse_list_tail(Arg, List, !TokensLeft, !PS) :-
> parse_arg(Tail0, !TokensLeft, !PS),
> (
> Tail0 = pr_ok(Tail),
> - ( if
> - !.TokensLeft = token_cons(close_list, _Context,
> - !:TokensLeft)
> - then
> - List = pr_ok(term.functor(term.atom("[|]"), [Arg, Tail],
> - TermContext))
> - else
> - parser_unexpected("expecting ']' or operator", List,
> - !TokensLeft, !.PS)
> + (
> + !.TokensLeft = token_cons(NextToken, NextContext,
> + !:TokensLeft),
> + ( if NextToken = close_list then
> + pop_nest_open(close_list, Context, MaybeErrorMsg, !PS),
> + (
> + MaybeErrorMsg = no,
> + Term = term.functor(term.atom("[|]"), [Arg, Tail],
> + TermContext),
> + List = pr_ok(Term)
> + ;
> + MaybeErrorMsg = yes(ErrorMsg),
> + List = pr_error(pr_error_ctxt(Context, ErrorMsg))
> + )
> + else
> + report_unexpected_token(NextToken, NextContext,
> + expected("']' or operator"), List,
> + !TokensLeft, !.PS)
Open with a backtick: `]'
> + )
> + ;
> + !.TokensLeft = token_nil,
> + report_unexpected_eof(expected("']' or operator"),
> + List, !.PS)
> )
Open with a backtick: `]'
> else
> - parser_unexpected_tok(Token, Context,
> - "expected comma, `|', `]', or operator",
> + report_unexpected_token(Token, Context,
> + expected("comma, `|', `]', or operator"),
> List, !TokensLeft, !.PS)
> )
> ;
> !.TokensLeft = token_nil,
> % XXX The error message should state the line that the list started on.
> - List = pr_error("unexpected end-of-file in list", !.TokensLeft)
> + % ZZZ
> + report_unexpected_eof(expected("',', '|' or ']'"), List, !.PS)
> ).
ZZZ here.
Use backticks.
> diff --git a/tests/invalid_nodepend/unbalanced.err_exp b/tests/invalid_nodepend/unbalanced.err_exp
> index e69de29bb..f04ac815a 100644
> --- a/tests/invalid_nodepend/unbalanced.err_exp
> +++ b/tests/invalid_nodepend/unbalanced.err_exp
> @@ -0,0 +1,20 @@
> +unbalanced.m:013: Error: predicate `p'/2 has no clauses.
> +unbalanced.m:029: Syntax error: expected `)' or operator, got token `. ', with
> +unbalanced.m:029: , and preceded by open '(' on line 21.
This message is faulty.
> +unbalanced.m:031: Error: predicate `q'/2 has no clauses.
> +unbalanced.m:038: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:038: token `)' (there is an open '[' on line 37 between the open
> +unbalanced.m:038: '(' on line 36 and the ')' here).
I find all the punctuation and symbols and line numbers hard to follow
and distracting.
Here is my suggestion:
Syntax error at token ')': expected comma, `|', `]', or operator.
There is an unclosed `[' on line 37.
I think it is enough to point out the location and identity of
the last unclosed bracket.
> +unbalanced.m:048: Error: predicate `r'/2 has no clauses.
> +unbalanced.m:051: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:051: token `}' (there is an open '[' on line 51 between the open
> +unbalanced.m:051: '{' on line 51 and the '}' here).
> +unbalanced.m:053: Error: predicate `s'/2 has no clauses.
> +unbalanced.m:056: Syntax error: expected start of (sub)term, got token `]'
> +unbalanced.m:056: (there is no open '[' to close here).
My suggestion:
Syntax error at token `]': expected start of (sub)term.
There is no open `[' to close here.
> +unbalanced.m:058: Error: predicate `t'/2 has no clauses.
> +unbalanced.m:061: Syntax error: expected comma, `|', `]', or operator, got
> +unbalanced.m:061: token `. ', with open '[' on line 61, and preceded by open
> +unbalanced.m:061: '(' on line 61.
> +unbalanced.m:064: Syntax error: expected ',', '|' or ']', got end-of-file, with
> +unbalanced.m:064: open '[' on line 64, and preceded by open '(' on line 63.
My suggestion:
Syntax error at end-of-file: expected `,', `|' or `]'.
There is an unclosed `[' on line 64.
Peter
More information about the reviews
mailing list