[m-rev.] for review: Make string.split_at_separator skip ill-formed sequences in UTF-8 strings.

Peter Wang novalazy at gmail.com
Wed Oct 30 17:09:39 AEDT 2019


library/string.m:
    Make split_at_separator never consider ill-formed sequences in UTF-8
    strings as potential separators, as they cannot contain any code
    points that could satify any given DelimP predicate on code points.
    Previously, split_at_separator would call DelimP(U+FFFD) for every
    code unit in an ill-formed sequence.
---
 library/string.m | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/library/string.m b/library/string.m
index 75dd54abd..37da065af 100644
--- a/library/string.m
+++ b/library/string.m
@@ -4249,12 +4249,11 @@ split_at_separator_loop(DelimP, Str, CurPos, PastSegEnd, !Segments) :-
     % Invariant: 0 =< CurPos =< length(Str).
     % PastSegEnd is one past the last index of the current segment.
     %
-    % XXX ILSEQ unsafe_prev_index fails at an ill-formed sequence.
-    % Ideally code units in an ill-form sequence are skipped over
-    % since they cannot be delimiters.
-    %
-    ( if unsafe_prev_index(Str, CurPos, PrevPos, Char) then
-        ( if DelimP(Char) then
+    ( if unsafe_prev_index_repl(Str, CurPos, PrevPos, Char, IsReplaced) then
+        ( if
+            IsReplaced = no,
+            DelimP(Char)
+        then
             % Chop here.
             SegStart = CurPos,
             Segment = unsafe_between(Str, SegStart, PastSegEnd),
-- 
2.23.0



More information about the reviews mailing list