[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Mark Brown mark at mercurylang.org
Thu Sep 12 00:46:55 AEST 2019


Hi,

On Wed, Sep 11, 2019 at 8:40 PM Julien Fischer <jfischer at opturion.com> wrote:
> On Wed, 11 Sep 2019, Peter Wang wrote:
> >> What actually happens now if you have non-UTF-8 characters in a comment?
> >
> > As of the lexer changes this year, it acts as if the file is truncated
> > at the first ill-formed sequence.
>
> Ok, so when we have an option for read_file_as_string that causes it
> to replace invalid code units with REPLACEMENT CHARACTER we ought to
> be able to improve on that.

I agree with Peter that this is best done at a higher level.
library/lexer.m currently uses string.unsafe_index_next, and if this
is changed to use replacement char then the improved behaviour will
happen already, I think.

Mark


More information about the reviews mailing list