[m-rev.] for review: Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.

Mark Brown mark at mercurylang.org
Wed Oct 30 19:25:42 AEDT 2019


This looks fine.

On Wed, Oct 30, 2019 at 5:10 PM Peter Wang <novalazy at gmail.com> wrote:
>
> library/string.m:
>     Make split_at_separator never consider ill-formed sequences in UTF-8
>     strings as potential separators, as they cannot contain any code
>     points that could satify any given DelimP predicate on code points.
>     Previously, split_at_separator would call DelimP(U+FFFD) for every
>     code unit in an ill-formed sequence.
> ---
>  library/string.m | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/library/string.m b/library/string.m
> index 75dd54abd..37da065af 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -4249,12 +4249,11 @@ split_at_separator_loop(DelimP, Str, CurPos, PastSegEnd, !Segments) :-
>      % Invariant: 0 =< CurPos =< length(Str).
>      % PastSegEnd is one past the last index of the current segment.
>      %
> -    % XXX ILSEQ unsafe_prev_index fails at an ill-formed sequence.
> -    % Ideally code units in an ill-form sequence are skipped over
> -    % since they cannot be delimiters.
> -    %
> -    ( if unsafe_prev_index(Str, CurPos, PrevPos, Char) then
> -        ( if DelimP(Char) then
> +    ( if unsafe_prev_index_repl(Str, CurPos, PrevPos, Char, IsReplaced) then
> +        ( if
> +            IsReplaced = no,
> +            DelimP(Char)
> +        then
>              % Chop here.
>              SegStart = CurPos,
>              Segment = unsafe_between(Str, SegStart, PastSegEnd),
> --
> 2.23.0
>
> _______________________________________________
> reviews mailing list
> reviews at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/reviews


More information about the reviews mailing list