[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Mark Brown mark at mercurylang.org
Thu Sep 5 17:29:07 AEST 2019


Hi Peter,

On Thu, Sep 5, 2019 at 4:34 PM Peter Wang <novalazy at gmail.com> wrote:
> [1] Note this goes against a Unicode "preference" to collapse a maximal
> invalid sequence of bytes that forms a prefix of a valid sequence into
> one REPLACEMENT CHARACTER. If we followed that guideline, each step of a
> iteration would have to collect up multiple bogus bytes instead of
> returning the bogus bytes one by one.

That seems to me to be a good approach, and there would be other ways
to access the underlying code units if needed (including new
predicates that return the char_or_code_unit type). What are the
drawbacks?

Mark


More information about the reviews mailing list