[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Peter Wang novalazy at gmail.com
Mon Sep 9 11:26:21 AEST 2019


Hi Mark,

On Thu, 5 Sep 2019 17:29:07 +1000, Mark Brown <mark at mercurylang.org> wrote:
> Hi Peter,
> 
> On Thu, Sep 5, 2019 at 4:34 PM Peter Wang <novalazy at gmail.com> wrote:
> > [1] Note this goes against a Unicode "preference" to collapse a maximal
> > invalid sequence of bytes that forms a prefix of a valid sequence into
> > one REPLACEMENT CHARACTER. If we followed that guideline, each step of a
> > iteration would have to collect up multiple bogus bytes instead of
> > returning the bogus bytes one by one.
> 
> That seems to me to be a good approach, and there would be other ways
> to access the underlying code units if needed (including new
> predicates that return the char_or_code_unit type). What are the
> drawbacks?

Both would work.

I think consistently advancing by one bogus code unit is simpler
to understand and remember. I don't see much point to the maximal
prefix behaviour. The only time it would make any difference is if you
accidentally or deliberately split a valid UTF-8 encoding of a code point,
as opposed to loading in text in a different charset and treating it
as UTF-8.

The main reason to use the maximal prefix behaviour would be to align with
the W3C encoding standard. Unicode 12 actually seems pretty ambivalent
towards the practice [1,2] (thanks to Julien for pointing it out).

One interface I was considering was something like this
(ignore the name):

    :- pred index_next_cu(string::in, int::in, int::out, char::out,
	maybe(uint16)::out) is semidet.

where the last argument returns the code unit in the case there is not
a valid code point encoded at the given index. If the predicate could
advance over multiple bogus code units at once, obviously the last
argument would have to return a list or array instead. It seems an
unnecessary complication for an uncommon case of an uncommon case.

I agree with your suggestion from our Friday meeting: a lot of programs
will work as intended if index_next returns U+FFFD. They wouldn't need
something like index_next_cu.

Peter

[1] https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf (p128)
[2] https://www.unicode.org/versions/Unicode12.0.0/ch05.pdf (p254)


More information about the reviews mailing list