[m-rev.] for post-commit review: improve string.m
Peter Wang
novalazy at gmail.com
Mon Mar 25 12:35:20 AEDT 2024
On Sun, 24 Mar 2024 19:18:10 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> Improve several aspects of string.m.
>
> diff --git a/library/string.m b/library/string.m
> index e1b78cc2a..41f56bb09 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -2390,11 +2398,13 @@ from_code_unit_list_allow_ill_formed(CodeList, Str) :-
>
> %---------------------%
>
> -from_utf8_code_unit_list(CodeList, String) :-
> +from_utf8_code_unit_list(CodeUnits, String) :-
> ( if internal_encoding_is_utf8 then
> - from_code_unit_list(CodeList, String)
> + from_code_unit_list(CodeUnits, String)
> else
> - decode_utf8(CodeList, [], RevChars),
> + decode_utf8(CodeUnits, [], RevChars),
> + % XXX This checks whether RevChars represents a well-formed string.
> + % Why? The call to decode_utf8 should have ensured that already.
> semidet_from_rev_char_list(RevChars, String)
> ).
decode_utf8 doesn't reject code points in the Unicode surrogate range
(D800-DFFF, strictly speaking not valid to encode in UTF-8),
which ARE rejected by semidet_from_rev_char_list.
The change to semidet_from_rev_char_list was made later.
If decode_utf8 (acc_rev_chars_from_utf8_code_units) detected surrogates,
then there could be a variant of semidet_from_rev_char_list that assumes
validity. I think it would not be worthwhile.
Peter
More information about the reviews
mailing list