[m-rev.] for review: Add string indexing predicates that indicate a code unit was replaced.
Peter Wang
novalazy at gmail.com
Fri Nov 15 18:00:30 AEDT 2019
On Fri, 15 Nov 2019 17:21:59 +1100 (AEDT), Julien Fischer <jfischer at opturion.com> wrote:
>
> Hi Peter,
>
> On Thu, 14 Nov 2019, Peter Wang wrote:
>
> > library/string.m:
> > Add index_next_repl, unsafe_index_next_repl, prev_index_repl,
> > unsafe_prev_index_repl predicates that return an indication if a
> > replacement character was returned because an ill-formed code unit
> > sequence was encountered.
> >
> > Add more pragma inlines for indexing predicates.
> >
> > Remove may_not_duplicate attribute on the Erlang version of
> > unsafe_prev_index_repl, which would conflict with the pragma inline
> > declaration.
>
> Does the compiler not complain about that?
It did, that's what prompted me to remove the attribute on the Erlang
foreign_proc.
> > diff --git a/library/string.m b/library/string.m
> > index a9fbfd693..a28c33792 100644
> > --- a/library/string.m
> > +++ b/library/string.m
> > @@ -262,6 +262,10 @@
> > % Reading characters from strings.
> > %
> >
> > +:- type maybe_replaced
> > + ---> not_replaced
> > + ; replaced_code_unit(uint8).
> > +
>
> Document that type.
>
> Is there a reason you didn't have the predicates below return a single
> value? (i.e. a value that's either a character in the string of the
> invalid code unit)
As in
:- type char_or_replaced
---> char(char)
; replaced_code_unit(uint8).
:- pred index_next_repl(string::in, int::in, int::in,
char_or_replaced::out) is semidet.
I forgot to consider it.
Using two return values has an advantage in that it avoids memory
allocation in the normal case. That advantage should go away once the
necessary argument packing optimisations are turned on.
The argument packing wouldn't be used for Java/C#.
The Java backend needs to allocate an array to return two values,
so maybe there is no difference there.
But the C# backend will not need to allocate an object to hold the char.
(I didn't measure anything; maybe none of this matters.)
> > @@ -324,6 +328,17 @@
> > %
> > :- pred index_next(string::in, int::in, int::out, char::uo) is semidet.
> >
> > + % index_next_repl(String, Index, NextIndex, Char, MaybeReplaced):
> > + %
> > + % Like index_next/4 but `MaybeReplaced' is `replaced_code_unit(CodeUnit)'
> > + % iff `Char' is U+FFFD but `String' does NOT contain an encoding of U+FFFD
>
> I suggest:
>
> iff `Char' s the Unicode REPLACEMENT CHARACTER (U+FFFD) but ...
>
> (It's probably good to get the actual name of that code point in a least
> one spot, so it's absolutely clear what is being referrred to.)
I'll add it.
Peter
More information about the reviews
mailing list