[m-rev.] for review: improve unicode diagnostics

Peter Wang novalazy at gmail.com
Tue Jun 4 15:17:54 AEST 2024


On Tue, 04 Jun 2024 03:12:47 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by Peter.
> 
> Zoltan.

> Improve diagnostics for Unicode errors.
> 
> library/mercury_term_lexer.m:
>     Generate more detailed diagnostics for errors in Unicode escapes.
> 
>     When we expect 4 or 8 hex digits to specify a Unicode code point
>     but get to an end-of-file before we get that many, we used to return
>     an eof token *without* any indication of the partial Unicode character.
>     Replace this with an error message that is specific to this situation.
> 
> tests/invalid_nodepend/unicode_1.err_exp:
> tests/invalid_nodepend/unicode_2.err_exp:
>     Expect updated diagnostics.

> +:- pred construct_unicode_char_if_valid(int::in,
> +    unicode_char_result::out) is det.
> +
> +construct_unicode_char_if_valid(Value, Result) :-
> +    ( if char.from_int(Value, UnicodeChar) then
> +        ( if char.is_surrogate(UnicodeChar) then
> +            DecodeError = unicode_char_is_surrogate(Value),
> +            Result = unicode_decode_error_to_result(DecodeError)
> +        else if Value = 0 then

char.from_int(0, _) will throw an exception(!) so you need to test
for Value = 0 first.

> +            DecodeError = unicode_char_null,
> +            Result = unicode_decode_error_to_result(DecodeError)
> +        else
> +            Result = unicode_char(UnicodeChar)
> +        )
> +    else
> +        DecodeError = unicode_char_not_in_range(Value),
> +        Result = unicode_decode_error_to_result(DecodeError)
>      ).
>  

> +:- func unicode_decode_error_to_result(unicode_decode_error)
> +    = unicode_char_result.
> +
> +unicode_decode_error_to_result(DecodeError) = Result :-
> +    (
> +        DecodeError = unicode_non_hex_digit(Char),
> +        string.format(
> +            "non-hex digit char `%c' in Unicode code point specification",
> +            [c(Char)], Msg),
> +        Token = error(Msg)

I suggest using the same terminology as the manual, which calls them
"Unicode escapes". "Unicode escape sequence" would also be okay.

I suggest "hexadecimal digit" instead of "non-hex digit char".

> +    ;
> +        DecodeError = unicode_hex_seq_incomplete(Expected, Got),
> +        string.format(
> +            "expected %d hex digits to specify a %s, got only `%d'",
> +            [i(Expected), s("Unicode code point"), i(Got)], Msg),
> +        Token = error(Msg)

Same here.

> +    ;
> +        DecodeError = unicode_char_null,
> +        Token = null_character_error
> +    ;
> +        DecodeError = unicode_char_not_in_range(N),
> +        string.format(
> +            "the Unicode code point `U+%X' is outside the valid range",
> +            [i(N)], Msg),
> +        Token = error(Msg)

I would change the message to:

    U+nnnn is not a valid Unicode code point

so as not to imply that there exist Unicode code points outside of the
valid range.

I would remove the `quotes'.

> +    ;
> +        DecodeError = unicode_char_is_surrogate(N),
> +        string.format(
> +            "the Unicode code point `U+%X' is reserved for surrogates",
> +            [i(N)], Msg),
> +        Token = error(Msg)
> +    ),
> +    Result = unicode_nonchar(Token).

I suggest:

    U+nnnn is a Unicode surrogate code point

Otherwise, that's fine.

Peter


More information about the reviews mailing list