[m-rev.] for review: Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.
Peter Wang
novalazy at gmail.com
Wed Oct 30 17:09:39 AEDT 2019
library/string.m:
Make split_at_separator never consider ill-formed sequences in UTF-8
strings as potential separators, as they cannot contain any code
points that could satify any given DelimP predicate on code points.
Previously, split_at_separator would call DelimP(U+FFFD) for every
code unit in an ill-formed sequence.
---
library/string.m | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/library/string.m b/library/string.m
index 75dd54abd..37da065af 100644
--- a/library/string.m
+++ b/library/string.m
@@ -4249,12 +4249,11 @@ split_at_separator_loop(DelimP, Str, CurPos, PastSegEnd, !Segments) :-
% Invariant: 0 =< CurPos =< length(Str).
% PastSegEnd is one past the last index of the current segment.
%
- % XXX ILSEQ unsafe_prev_index fails at an ill-formed sequence.
- % Ideally code units in an ill-form sequence are skipped over
- % since they cannot be delimiters.
- %
- ( if unsafe_prev_index(Str, CurPos, PrevPos, Char) then
- ( if DelimP(Char) then
+ ( if unsafe_prev_index_repl(Str, CurPos, PrevPos, Char, IsReplaced) then
+ ( if
+ IsReplaced = no,
+ DelimP(Char)
+ then
% Chop here.
SegStart = CurPos,
Segment = unsafe_between(Str, SegStart, PastSegEnd),
--
2.23.0
More information about the reviews
mailing list