[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Peter Wang novalazy at gmail.com
Wed Sep 11 18:00:17 AEST 2019


On Wed, 11 Sep 2019 17:40:39 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
> 
> Ok, I'm fine with that.  What I do want is that an identifible subset of
> the string module (and other related modules) that will allow me to (a)
> (optionally) validate that a string is well-formed and (b) preserve that
> well-formedness.  I don't mind if predicates in the string module also
> handle ill-formed subsequences (unless there is a significant overhead
> in doing so).

Great.

We should definitely add a string.verify_encoding predicate soon.
How is that for a name?

> 
> What actually happens now if you have non-UTF-8 characters in a comment?

As of the lexer changes this year, it acts as if the file is truncated
at the first ill-formed sequence.

Peter


More information about the reviews mailing list