[m-rev.] for review/discussion: ill-formed sequences in Unicode strings
Julien Fischer
jfischer at opturion.com
Thu Sep 5 17:29:53 AEST 2019
Hi Peter,
On Thu, 5 Sep 2019, Peter Wang wrote:
> [1] Note this goes against a Unicode "preference" to collapse a maximal
> invalid sequence of bytes that forms a prefix of a valid sequence into
> one REPLACEMENT CHARACTER. If we followed that guideline, each step of a
> iteration would have to collect up multiple bogus bytes instead of
> returning the bogus bytes one by one.
In relation to that, the Unicode standard (version 12, pg. 128)
explicitly says:
The Unicode Standard does not require this practice for conformance
So, it is presumably not that strong a preference. Prior to that, it
says that returning one replacement character per invalid byte is
conforming.
Julien.
More information about the reviews
mailing list