[m-rev.] for review: Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.
Mark Brown
mark at mercurylang.org
Wed Oct 30 19:25:42 AEDT 2019
This looks fine.
On Wed, Oct 30, 2019 at 5:10 PM Peter Wang <novalazy at gmail.com> wrote:
>
> library/string.m:
> Make split_at_separator never consider ill-formed sequences in UTF-8
> strings as potential separators, as they cannot contain any code
> points that could satify any given DelimP predicate on code points.
> Previously, split_at_separator would call DelimP(U+FFFD) for every
> code unit in an ill-formed sequence.
> ---
> library/string.m | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/library/string.m b/library/string.m
> index 75dd54abd..37da065af 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -4249,12 +4249,11 @@ split_at_separator_loop(DelimP, Str, CurPos, PastSegEnd, !Segments) :-
> % Invariant: 0 =< CurPos =< length(Str).
> % PastSegEnd is one past the last index of the current segment.
> %
> - % XXX ILSEQ unsafe_prev_index fails at an ill-formed sequence.
> - % Ideally code units in an ill-form sequence are skipped over
> - % since they cannot be delimiters.
> - %
> - ( if unsafe_prev_index(Str, CurPos, PrevPos, Char) then
> - ( if DelimP(Char) then
> + ( if unsafe_prev_index_repl(Str, CurPos, PrevPos, Char, IsReplaced) then
> + ( if
> + IsReplaced = no,
> + DelimP(Char)
> + then
> % Chop here.
> SegStart = CurPos,
> Segment = unsafe_between(Str, SegStart, PastSegEnd),
> --
> 2.23.0
>
> _______________________________________________
> reviews mailing list
> reviews at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/reviews
More information about the reviews
mailing list