[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Julien Fischer jfischer at opturion.com
Wed Sep 11 20:40:08 AEST 2019



On Wed, 11 Sep 2019, Peter Wang wrote:

> On Wed, 11 Sep 2019 17:40:39 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
>>
>> Ok, I'm fine with that.  What I do want is that an identifible subset of
>> the string module (and other related modules) that will allow me to (a)
>> (optionally) validate that a string is well-formed and (b) preserve that
>> well-formedness.  I don't mind if predicates in the string module also
>> handle ill-formed subsequences (unless there is a significant overhead
>> in doing so).

Actually, it would be simpler to assume preservation of well-formedness
and just document the handful the predicates that may not preserve it
(where that is not alread done so).

> We should definitely add a string.verify_encoding predicate soon.
> How is that for a name?

I would prefer something along the lines of has_valid_encoding or
is_well_formed.

>> What actually happens now if you have non-UTF-8 characters in a comment?
>
> As of the lexer changes this year, it acts as if the file is truncated
> at the first ill-formed sequence.

Ok, so when we have an option for read_file_as_string that causes it
to replace invalid code units with REPLACEMENT CHARACTER we ought to
be able to improve on that.
(We should probably restrict quoted names in Mercury so that they
cannot contain REPLACEMENT CHARACTER to be begin with.)

Julien.


More information about the reviews mailing list