[m-rev.] for review: Fix some handling of ill-formed sequences in string module.

Peter Wang novalazy at gmail.com
Wed Jul 27 14:28:42 AEST 2022


On Wed, 27 Jul 2022 12:32:40 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 2022-07-27 11:34 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> >> >    Fix the Mercury implementation of string.contains_char to continue
> >> >    searching for the character past any ill-formed code unit sequences.
> >> 
> >> Do we even need the Mercury implementation since there are foreign_proc
> >> implementations for all three target languages?
> >> 
> > 
> > Not really, but we have them for other predicates as well.
> 
> Were those left over from when they were the definition in Erlang grades?
> 

I thought we had pure Mercury implementations as a fallback where
possible, which was useful for bringing up backends or maybe as a
reference. We can delete them if they serve no purpose.

> >  ### Changes to the `string` module
> > 
> > +* We have fixed the behaviour of the following predicates when called on a
> > +  string containing ill-formed code unit sequences:
> > +
> > +   - pred `all_match/2`
> > +   - pred `index_next_repl/5`
> > +   - pred `unsafe_index_next_repl/5`
> > +   - pred `prev_index_repl/5`
> > +   - pred `unsafe_prev_index_repl/5`
> 
> I would s/fixed/clarified/.
> 

I'll change it to "fixed and clarified".

> > --- a/library/string.m
> > +++ b/library/string.m
> > @@ -576,14 +576,15 @@
> >      % all_match(TestPred, String):
> >      %
> >      % True iff String is empty or contains only code points that satisfy
> > -    % TestPred.
> > +    % TestPred. False if String contains an ill-formed code unit sequence.
> >      %
> 
> That doesn't work, because the second sentence adds a condition for False
> that the "iff" in the first sentence rules out. How about
> 
> True iff all code points in String satisfy Pred, and String contains no ill-formed
> code unit sequences.

Ok. The second sentence wasn't intended to add an extra condition,
but I can see how it might be interpreted that way.

> 
> The other two comment changes are fine.
> 
> The NEWS update mentions two other predicates: why no update
> to them?

Do you mean (unsafe_)index_next_repl and (unsafe_)prev_index_repl?
The change I made was to fix an edge case. However, their documentation
wasn't terribly clear so I've revised them below (for review).

Peter

diff --git a/NEWS b/NEWS
index 67148bf68..0bbd29c1f 100644
--- a/NEWS
+++ b/NEWS
@@ -355,6 +355,15 @@ Changes to the Mercury standard library

 ### Changes to the `string` module

+* We have fixed and clarified the behaviour of the following predicates when
+  called on a string containing ill-formed code unit sequences:
+
+   - pred `all_match/2`
+   - pred `index_next_repl/5`
+   - pred `unsafe_index_next_repl/5`
+   - pred `prev_index_repl/5`
+   - pred `unsafe_prev_index_repl/5`
+
 * The following predicate has been added:

    - pred `contains_match/2`
diff --git a/library/string.m b/library/string.m
index 573ea2909..4307f55bd 100644
--- a/library/string.m
+++ b/library/string.m
@@ -301,12 +301,15 @@

     % index_next_repl(String, Index, NextIndex, Char, MaybeReplaced):
     %
-    % Like index_next/4 but MaybeReplaced is `replaced_code_unit(CodeUnit)'
-    % iff Char is the Unicode REPLACEMENT CHARACTER (U+FFFD) but String
-    % does NOT contain an encoding of U+FFFD beginning at Index;
-    % CodeUnit is the code unit at Index.
-    % (MaybeReplaced is always `not_replaced' when strings are UTF-16
-    % encoded.)
+    % Like index_next/4 but also returns MaybeReplaced on success.
+    % When Char is not U+FFFD then MaybeReplaced is always `not_replaced'.
+    % When Char is U+FFFD then there are two cases:
+    %
+    % - If there is a U+FFFD code point encoded in String at
+    %   [Index, NextIndex) then MaybeReplaced is `not_replaced'.
+    %
+    % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
+    %   CodeUnit is the code unit in String at Index.
     %
 :- pred index_next_repl(string::in, int::in, int::out, char::uo,
     maybe_replaced::out) is semidet.
@@ -350,12 +353,15 @@

     % prev_index_repl(String, Index, PrevIndex, Char, MaybeReplaced):
     %
-    % Like prev_index/4 but MaybeReplaced is `replaced_code_unit(CodeUnit)'
-    % iff Char is the Unicode REPLACEMENT CHARACTER (U+FFFD) but String
-    % does NOT contain an encoding of U+FFFD ending at Index - 1;
-    % CodeUnit is the code unit at Index - 1.
-    % (MaybeReplaced is always `not_replaced' when strings are UTF-16
-    % encoded.)
+    % Like prev_index/4 but also returns MaybeReplaced on success.
+    % When Char is not U+FFFD then MaybeReplaced is always `not_replaced'.
+    % When Char is U+FFFD then there are two cases:
+    %
+    % - If there is a U+FFFD code point encoded in String at
+    %   [PrevIndex, Index) then MaybeReplaced is `not_replaced'.
+    %
+    % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
+    %   CodeUnit is the code unit in String at Index - 1.
     %
 :- pred prev_index_repl(string::in, int::in, int::out, char::uo,
     maybe_replaced::out) is semidet.
@@ -575,15 +581,16 @@

     % all_match(TestPred, String):
     %
-    % True iff String is empty or contains only code points that satisfy
-    % TestPred.
+    % True iff all code points in String satisfy TestPred, and String contains
+    % no ill-formed code unit sequences.
     %
 :- pred all_match(pred(char)::in(pred(in) is semidet), string::in) is semidet.

     % contains_match(TestPred, String):
     %
     % True iff String contains at least one code point that satisfies
-    % TestPred.
+    % TestPred. Any ill-formed code unit sequences in String are ignored
+    % as they do not encode code points.
     %
 :- pred contains_match(pred(char)::in(pred(in) is semidet), string::in)
     is semidet.
@@ -591,6 +598,8 @@
     % contains_char(String, Char):
     %
     % Succeed if the code point Char occurs in String.
+    % Any ill-formed code unit sequences within String are ignored
+    % as they will not contain Char.
     %
 :- pred contains_char(string::in, char::in) is semidet.




More information about the reviews mailing list