[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Mark Brown mark at mercurylang.org
Mon Sep 9 15:04:54 AEST 2019


Hi,

On Mon, Sep 9, 2019 at 11:26 AM Peter Wang <novalazy at gmail.com> wrote:
>
> Hi Mark,
>
> On Thu, 5 Sep 2019 17:29:07 +1000, Mark Brown <mark at mercurylang.org> wrote:
> > Hi Peter,
> >
> > On Thu, Sep 5, 2019 at 4:34 PM Peter Wang <novalazy at gmail.com> wrote:
> > > [1] Note this goes against a Unicode "preference" to collapse a maximal
> > > invalid sequence of bytes that forms a prefix of a valid sequence into
> > > one REPLACEMENT CHARACTER. If we followed that guideline, each step of a
> > > iteration would have to collect up multiple bogus bytes instead of
> > > returning the bogus bytes one by one.
> >
> > That seems to me to be a good approach, and there would be other ways
> > to access the underlying code units if needed (including new
> > predicates that return the char_or_code_unit type). What are the
> > drawbacks?
>
> Both would work.
>
> I think consistently advancing by one bogus code unit is simpler
> to understand and remember. I don't see much point to the maximal
> prefix behaviour.

That's fine by me.

Mark


More information about the reviews mailing list