[m-rev.] for review: Document that string.count_utf8_code_units throws exceptions.
Mark Brown
mark at mercurylang.org
Tue Nov 5 02:00:48 AEDT 2019
This looks fine.
On Mon, Nov 4, 2019 at 4:52 PM Peter Wang <novalazy at gmail.com> wrote:
>
> library/string.m:
> Document that count_utf8_code_units throws an exception if the
> string contains an unpaired surrogate code point.
>
> Make the exception message thrown more useful to callers.
>
> Delete unnecessary foreign_procs.
> ---
> library/string.m | 40 +++++++++++++++++-----------------------
> 1 file changed, 17 insertions(+), 23 deletions(-)
>
> diff --git a/library/string.m b/library/string.m
> index efdce899c..275a49728 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -455,8 +455,15 @@
> :- func count_codepoints(string) = int.
> :- pred count_codepoints(string::in, int::out) is det.
>
> - % Determine the number of code units required to represent a string in
> - % UTF-8 encoding.
> + % count_utf8_code_units(String) = Length:
> + %
> + % Return the number of code units required to represent a string in
> + % UTF-8 encoding (with allowance for ill-formed sequences).
> + % Equivalent to `Length = length(to_utf8_code_unit_list(String))'.
> + %
> + % Throws an exception if strings use UTF-16 encoding but the given string
> + % contains an unpaired surrogate code point. Surrogate code points cannot
> + % be represented in UTF-8.
> %
> :- func count_utf8_code_units(string) = int.
>
> @@ -3010,36 +3017,23 @@ count_codepoints_loop(String, I, Count0, Count) :-
>
> %---------------------%
>
> -% XXX ILSEQ Behaviour depends on target language.
> -% In UTF-16 grades, count_utf8_code_units uses string.foldl which (currently)
> -% stops at the first ill-formed sequence.
> -
> -:- pragma foreign_proc("C",
> - count_utf8_code_units(Str::in) = (Length::out),
> - [will_not_call_mercury, promise_pure, thread_safe],
> -"
> - Length = strlen(Str);
> -").
> -:- pragma foreign_proc("Erlang",
> - count_utf8_code_units(Str::in) = (Length::out),
> - [will_not_call_mercury, promise_pure, thread_safe],
> -"
> - Length = size(Str)
> -").
> -
> count_utf8_code_units(String) = Length :-
> - foldl(count_utf8_code_units_2, String, 0, Length).
> + ( if internal_encoding_is_utf8 then
> + Length = length(String)
> + else
> + foldl(count_utf16_to_utf8_code_units, String, 0, Length)
> + ).
>
> -:- pred count_utf8_code_units_2(char::in, int::in, int::out) is det.
> +:- pred count_utf16_to_utf8_code_units(char::in, int::in, int::out) is det.
>
> -count_utf8_code_units_2(Char, !Length) :-
> +count_utf16_to_utf8_code_units(Char, !Length) :-
> char.to_int(Char, CharInt),
> ( if CharInt =< 0x7f then
> !:Length = !.Length + 1
> else if char.to_utf8(Char, UTF8) then
> !:Length = !.Length + list.length(UTF8)
> else
> - error($pred, "char.to_utf8 failed")
> + error($pred, "surrogate code point")
> ).
>
> %---------------------%
> --
> 2.23.0
>
> _______________________________________________
> reviews mailing list
> reviews at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/reviews
More information about the reviews
mailing list