[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Julien Fischer jfischer at opturion.com
Thu Sep 5 17:29:53 AEST 2019


Hi Peter,

On Thu, 5 Sep 2019, Peter Wang wrote:

> [1] Note this goes against a Unicode "preference" to collapse a maximal
> invalid sequence of bytes that forms a prefix of a valid sequence into
> one REPLACEMENT CHARACTER. If we followed that guideline, each step of a
> iteration would have to collect up multiple bogus bytes instead of
> returning the bogus bytes one by one.

In relation to that, the Unicode standard (version 12, pg. 128)
explicitly says:

     The Unicode Standard does not require this practice for conformance

So, it is presumably not that strong a preference.  Prior to that, it
says that returning one replacement character per invalid byte is
conforming.

Julien.


More information about the reviews mailing list