[m-rev.] for review: Define behaviour of string.count_codepoints with ill-formed sequences.

Mark Brown mark at mercurylang.org
Thu Oct 17 03:56:07 AEDT 2019


This looks fine.

On Wed, Oct 16, 2019 at 5:05 PM Peter Wang <novalazy at gmail.com> wrote:
>
> library/string.m:
>     Make each code unit in an ill-formed sequence contribute one
>     to the value of string.count_codepoints.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_count_codepoints_ilseq.exp:
> tests/hard_coded/string_count_codepoints_ilseq.exp2:
> tests/hard_coded/string_count_codepoints_ilseq.m:
>     Add test case.
> ---
>  library/string.m                              | 44 +++----------------
>  tests/hard_coded/Mmakefile                    |  1 +
>  .../string_count_codepoints_ilseq.exp         |  1 +
>  .../string_count_codepoints_ilseq.exp2        |  1 +
>  .../string_count_codepoints_ilseq.m           | 35 +++++++++++++++
>  5 files changed, 43 insertions(+), 39 deletions(-)
>  create mode 100644 tests/hard_coded/string_count_codepoints_ilseq.exp
>  create mode 100644 tests/hard_coded/string_count_codepoints_ilseq.exp2
>  create mode 100644 tests/hard_coded/string_count_codepoints_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index 0501a3198..884363c82 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -421,6 +421,11 @@
>
>      % Determine the number of code points in a string.
>      %
> +    % Each valid code point, and each code unit that is part of an ill-formed
> +    % sequence, contributes one to the result.
> +    % (This matches the number of steps it would take to iterate over the
> +    % string using string.index_next or string.prev_index.)
> +    %
>      % NOTE The names of this predicate and several other predicates
>      % may be changed in the future to refer to code_points, not codepoints,
>      % for consistency with predicate names that talk about code_units.
> @@ -2737,48 +2742,9 @@ count_code_units(Str, Length) :-
>
>  %---------------------%
>
> -% XXX ILSEQ Behaviour depends on target language.
> -%   - c: counts lead bytes without checking trailing bytes
> -%   - java: unpaired surrogates count as one code point each
> -%   - csharp: counts non-surrogates and high surrogates
> -%   - generic: unsafe_index_next stops at the first ill-formed sequence
> -%
> -% The Java behaviour seems best: count well-formed code points plus each code
> -% unit in an ill-formed sequence. index_next and other iteration predicates
> -% should be made consistent with however count_codepoints ends up counting.
> -
>  count_codepoints(String) = Count :-
>      count_codepoints(String, Count).
>
> -:- pragma foreign_proc("C",
> -    count_codepoints(String::in, Count::out),
> -    [will_not_call_mercury, promise_pure, thread_safe, may_not_duplicate],
> -"
> -    unsigned char   b;
> -    int             i;
> -
> -    Count = 0;
> -    for (i = 0; ; i++) {
> -        b = String[i];
> -        if (b == '\\0') {
> -            break;
> -        }
> -        if (MR_utf8_is_single_byte(b) || MR_utf8_is_lead_byte(b)) {
> -            Count++;
> -        }
> -    }
> -").
> -:- pragma foreign_proc("C#",
> -    count_codepoints(String::in, Count::out),
> -    [will_not_call_mercury, promise_pure, thread_safe, may_not_duplicate],
> -"
> -    Count = 0;
> -    foreach (char c in String) {
> -        if (!System.Char.IsLowSurrogate(c)) {
> -            Count++;
> -        }
> -    }
> -").
>  :- pragma foreign_proc("Java",
>      count_codepoints(String::in, Count::out),
>      [will_not_call_mercury, promise_pure, thread_safe, may_not_duplicate],
> diff --git a/tests/hard_coded/Mmakefile b/tests/hard_coded/Mmakefile
> index f11b93eb0..7906dcbf3 100644
> --- a/tests/hard_coded/Mmakefile
> +++ b/tests/hard_coded/Mmakefile
> @@ -358,6 +358,7 @@ ORDINARY_PROGS = \
>         string_class \
>         string_code_unit \
>         string_codepoint \
> +       string_count_codepoints_ilseq \
>         string_first_char \
>         string_index_ilseq \
>         string_index_next_ilseq \
> diff --git a/tests/hard_coded/string_count_codepoints_ilseq.exp b/tests/hard_coded/string_count_codepoints_ilseq.exp
> new file mode 100644
> index 000000000..7f8f011eb
> --- /dev/null
> +++ b/tests/hard_coded/string_count_codepoints_ilseq.exp
> @@ -0,0 +1 @@
> +7
> diff --git a/tests/hard_coded/string_count_codepoints_ilseq.exp2 b/tests/hard_coded/string_count_codepoints_ilseq.exp2
> new file mode 100644
> index 000000000..7ed6ff82d
> --- /dev/null
> +++ b/tests/hard_coded/string_count_codepoints_ilseq.exp2
> @@ -0,0 +1 @@
> +5
> diff --git a/tests/hard_coded/string_count_codepoints_ilseq.m b/tests/hard_coded/string_count_codepoints_ilseq.m
> new file mode 100644
> index 000000000..a382a00b0
> --- /dev/null
> +++ b/tests/hard_coded/string_count_codepoints_ilseq.m
> @@ -0,0 +1,35 @@
> +%---------------------------------------------------------------------------%
> +% vim: ts=4 sw=4 et ft=mercury
> +%---------------------------------------------------------------------------%
> +%
> +% The .exp file is for backends using UTF-8 string encoding.
> +% The .exp2 file is for backends using UTF-16 string encoding.
> +%
> +%---------------------------------------------------------------------------%
> +
> +:- module string_count_codepoints_ilseq.
> +:- interface.
> +
> +:- import_module io.
> +
> +:- pred main(io::di, io::uo) is det.
> +
> +%---------------------------------------------------------------------------%
> +
> +:- implementation.
> +
> +:- import_module char.
> +:- import_module int.
> +:- import_module list.
> +:- import_module string.
> +
> +%---------------------------------------------------------------------------%
> +
> +main(!IO) :-
> +    S0 = "",
> +    S1 = string.between(S0, 0, 1),
> +    S2 = string.between(S0, 1, count_code_units(S0)),
> +    S = S0 ++ S1 ++ "," ++ S2 ++ S0,
> +    string.count_codepoints(S, Num),
> +    io.write_int(Num, !IO),
> +    io.nl(!IO).
> --
> 2.23.0
>
> _______________________________________________
> reviews mailing list
> reviews at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/reviews


More information about the reviews mailing list