[m-rev.] for review: improve unicode diagnostics
Peter Wang
novalazy at gmail.com
Tue Jun 4 15:17:54 AEST 2024
On Tue, 04 Jun 2024 03:12:47 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by Peter.
>
> Zoltan.
> Improve diagnostics for Unicode errors.
>
> library/mercury_term_lexer.m:
> Generate more detailed diagnostics for errors in Unicode escapes.
>
> When we expect 4 or 8 hex digits to specify a Unicode code point
> but get to an end-of-file before we get that many, we used to return
> an eof token *without* any indication of the partial Unicode character.
> Replace this with an error message that is specific to this situation.
>
> tests/invalid_nodepend/unicode_1.err_exp:
> tests/invalid_nodepend/unicode_2.err_exp:
> Expect updated diagnostics.
> +:- pred construct_unicode_char_if_valid(int::in,
> + unicode_char_result::out) is det.
> +
> +construct_unicode_char_if_valid(Value, Result) :-
> + ( if char.from_int(Value, UnicodeChar) then
> + ( if char.is_surrogate(UnicodeChar) then
> + DecodeError = unicode_char_is_surrogate(Value),
> + Result = unicode_decode_error_to_result(DecodeError)
> + else if Value = 0 then
char.from_int(0, _) will throw an exception(!) so you need to test
for Value = 0 first.
> + DecodeError = unicode_char_null,
> + Result = unicode_decode_error_to_result(DecodeError)
> + else
> + Result = unicode_char(UnicodeChar)
> + )
> + else
> + DecodeError = unicode_char_not_in_range(Value),
> + Result = unicode_decode_error_to_result(DecodeError)
> ).
>
> +:- func unicode_decode_error_to_result(unicode_decode_error)
> + = unicode_char_result.
> +
> +unicode_decode_error_to_result(DecodeError) = Result :-
> + (
> + DecodeError = unicode_non_hex_digit(Char),
> + string.format(
> + "non-hex digit char `%c' in Unicode code point specification",
> + [c(Char)], Msg),
> + Token = error(Msg)
I suggest using the same terminology as the manual, which calls them
"Unicode escapes". "Unicode escape sequence" would also be okay.
I suggest "hexadecimal digit" instead of "non-hex digit char".
> + ;
> + DecodeError = unicode_hex_seq_incomplete(Expected, Got),
> + string.format(
> + "expected %d hex digits to specify a %s, got only `%d'",
> + [i(Expected), s("Unicode code point"), i(Got)], Msg),
> + Token = error(Msg)
Same here.
> + ;
> + DecodeError = unicode_char_null,
> + Token = null_character_error
> + ;
> + DecodeError = unicode_char_not_in_range(N),
> + string.format(
> + "the Unicode code point `U+%X' is outside the valid range",
> + [i(N)], Msg),
> + Token = error(Msg)
I would change the message to:
U+nnnn is not a valid Unicode code point
so as not to imply that there exist Unicode code points outside of the
valid range.
I would remove the `quotes'.
> + ;
> + DecodeError = unicode_char_is_surrogate(N),
> + string.format(
> + "the Unicode code point `U+%X' is reserved for surrogates",
> + [i(N)], Msg),
> + Token = error(Msg)
> + ),
> + Result = unicode_nonchar(Token).
I suggest:
U+nnnn is a Unicode surrogate code point
Otherwise, that's fine.
Peter
More information about the reviews
mailing list