[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Julien Fischer jfischer at opturion.com
Wed Sep 11 17:40:39 AEST 2019


On Wed, 11 Sep 2019, Peter Wang wrote:

> On Wed, 11 Sep 2019 14:06:51 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:

...

>>> On the other hand, allowing ill-formed sequences makes the string type
>>> and its operations more useful. It should be possible (and not hard)
>>> to work on the ASCII or well-formed subparts of a string and leave the
>>> ill-formed subparts intact.
>>
>> Are you saying that is the only behaviour the standard library should
>> support or *a* behaviour?  I'm fine with the latter, not the former.
>
> I'm saying we should NOT make any guarantee that Mercury strings only
> contain well-formed code unit sequences.

Ok, I'm fine with that.  What I do want is that an identifible subset of
the string module (and other related modules) that will allow me to (a)
(optionally) validate that a string is well-formed and (b) preserve that
well-formedness.  I don't mind if predicates in the string module also
handle ill-formed subsequences (unless there is a significant overhead
in doing so).

>> That's perfectly consistent with the way Mercury is defined!  That said,
>> in the specific case of non-UTF-8 characters in comments, the compiler
>> should probably recover a little more gracefully than it currently does.
>
> All right, not the best example since we defined Mercury source files
> to contain UTF-8. But rejecting non-UTF-8 chars is better done at the
> parser level (as it is now), and that requires strings to be able to
> hold the non-UTF-8 sequences so that the parser can see them.
>
> (In this particular case, io.read_file_as_string et al could
> replace the non-UTF-8 sequences with replacement chars
> but other programs would want the original bytes available.)

What actually happens now if you have non-UTF-8 characters in a comment?

Julien.


More information about the reviews mailing list