[m-rev.] for review: Fix some handling of ill-formed sequences in string module.
Peter Wang
novalazy at gmail.com
Wed Jul 27 14:28:42 AEST 2022
On Wed, 27 Jul 2022 12:32:40 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>
> 2022-07-27 11:34 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> >> > Fix the Mercury implementation of string.contains_char to continue
> >> > searching for the character past any ill-formed code unit sequences.
> >>
> >> Do we even need the Mercury implementation since there are foreign_proc
> >> implementations for all three target languages?
> >>
> >
> > Not really, but we have them for other predicates as well.
>
> Were those left over from when they were the definition in Erlang grades?
>
I thought we had pure Mercury implementations as a fallback where
possible, which was useful for bringing up backends or maybe as a
reference. We can delete them if they serve no purpose.
> > ### Changes to the `string` module
> >
> > +* We have fixed the behaviour of the following predicates when called on a
> > + string containing ill-formed code unit sequences:
> > +
> > + - pred `all_match/2`
> > + - pred `index_next_repl/5`
> > + - pred `unsafe_index_next_repl/5`
> > + - pred `prev_index_repl/5`
> > + - pred `unsafe_prev_index_repl/5`
>
> I would s/fixed/clarified/.
>
I'll change it to "fixed and clarified".
> > --- a/library/string.m
> > +++ b/library/string.m
> > @@ -576,14 +576,15 @@
> > % all_match(TestPred, String):
> > %
> > % True iff String is empty or contains only code points that satisfy
> > - % TestPred.
> > + % TestPred. False if String contains an ill-formed code unit sequence.
> > %
>
> That doesn't work, because the second sentence adds a condition for False
> that the "iff" in the first sentence rules out. How about
>
> True iff all code points in String satisfy Pred, and String contains no ill-formed
> code unit sequences.
Ok. The second sentence wasn't intended to add an extra condition,
but I can see how it might be interpreted that way.
>
> The other two comment changes are fine.
>
> The NEWS update mentions two other predicates: why no update
> to them?
Do you mean (unsafe_)index_next_repl and (unsafe_)prev_index_repl?
The change I made was to fix an edge case. However, their documentation
wasn't terribly clear so I've revised them below (for review).
Peter
diff --git a/NEWS b/NEWS
index 67148bf68..0bbd29c1f 100644
--- a/NEWS
+++ b/NEWS
@@ -355,6 +355,15 @@ Changes to the Mercury standard library
### Changes to the `string` module
+* We have fixed and clarified the behaviour of the following predicates when
+ called on a string containing ill-formed code unit sequences:
+
+ - pred `all_match/2`
+ - pred `index_next_repl/5`
+ - pred `unsafe_index_next_repl/5`
+ - pred `prev_index_repl/5`
+ - pred `unsafe_prev_index_repl/5`
+
* The following predicate has been added:
- pred `contains_match/2`
diff --git a/library/string.m b/library/string.m
index 573ea2909..4307f55bd 100644
--- a/library/string.m
+++ b/library/string.m
@@ -301,12 +301,15 @@
% index_next_repl(String, Index, NextIndex, Char, MaybeReplaced):
%
- % Like index_next/4 but MaybeReplaced is `replaced_code_unit(CodeUnit)'
- % iff Char is the Unicode REPLACEMENT CHARACTER (U+FFFD) but String
- % does NOT contain an encoding of U+FFFD beginning at Index;
- % CodeUnit is the code unit at Index.
- % (MaybeReplaced is always `not_replaced' when strings are UTF-16
- % encoded.)
+ % Like index_next/4 but also returns MaybeReplaced on success.
+ % When Char is not U+FFFD then MaybeReplaced is always `not_replaced'.
+ % When Char is U+FFFD then there are two cases:
+ %
+ % - If there is a U+FFFD code point encoded in String at
+ % [Index, NextIndex) then MaybeReplaced is `not_replaced'.
+ %
+ % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
+ % CodeUnit is the code unit in String at Index.
%
:- pred index_next_repl(string::in, int::in, int::out, char::uo,
maybe_replaced::out) is semidet.
@@ -350,12 +353,15 @@
% prev_index_repl(String, Index, PrevIndex, Char, MaybeReplaced):
%
- % Like prev_index/4 but MaybeReplaced is `replaced_code_unit(CodeUnit)'
- % iff Char is the Unicode REPLACEMENT CHARACTER (U+FFFD) but String
- % does NOT contain an encoding of U+FFFD ending at Index - 1;
- % CodeUnit is the code unit at Index - 1.
- % (MaybeReplaced is always `not_replaced' when strings are UTF-16
- % encoded.)
+ % Like prev_index/4 but also returns MaybeReplaced on success.
+ % When Char is not U+FFFD then MaybeReplaced is always `not_replaced'.
+ % When Char is U+FFFD then there are two cases:
+ %
+ % - If there is a U+FFFD code point encoded in String at
+ % [PrevIndex, Index) then MaybeReplaced is `not_replaced'.
+ %
+ % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
+ % CodeUnit is the code unit in String at Index - 1.
%
:- pred prev_index_repl(string::in, int::in, int::out, char::uo,
maybe_replaced::out) is semidet.
@@ -575,15 +581,16 @@
% all_match(TestPred, String):
%
- % True iff String is empty or contains only code points that satisfy
- % TestPred.
+ % True iff all code points in String satisfy TestPred, and String contains
+ % no ill-formed code unit sequences.
%
:- pred all_match(pred(char)::in(pred(in) is semidet), string::in) is semidet.
% contains_match(TestPred, String):
%
% True iff String contains at least one code point that satisfies
- % TestPred.
+ % TestPred. Any ill-formed code unit sequences in String are ignored
+ % as they do not encode code points.
%
:- pred contains_match(pred(char)::in(pred(in) is semidet), string::in)
is semidet.
@@ -591,6 +598,8 @@
% contains_char(String, Char):
%
% Succeed if the code point Char occurs in String.
+ % Any ill-formed code unit sequences within String are ignored
+ % as they will not contain Char.
%
:- pred contains_char(string::in, char::in) is semidet.
More information about the reviews
mailing list