[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Julien Fischer jfischer at opturion.com
Wed Sep 11 11:24:46 AEST 2019


Hi Peter,

Some thoughts on this.

On Thu, 5 Sep 2019, Peter Wang wrote:

> Back when I introduced Unicode string support to Mercury, I left the
> question of what to do about strings containing invalid encodings for
> future work.
>
> A tempting option is to say that Mercury strings MUST always contain
> valid UTF-8 or UTF-16.

For most predicates that return strings, that's precisely what should
occur.

> Any foreign code that creates a string would need to validate it; good luck
> with that. The compiler could generate code to automatically validate any
> string returned by a foreign_proc, but it is still inefficient. And then
> what, throw an exception?

Foreign procs that wrap code that is known to return well encoded strings
should not have to carry out such checks.  Ideally, we would add an additional
foreign code attribute that allows the programmer to specify whether the
Mercury compiler inserts such checks on string output arguments for
foreign_procs,

As to what should be done when an encoding error is detected, there seem
to be two reasonable behaviours:

   1. Throw an exception.
   2. Replace the invalid part of the string with a replacement character.

Which one of these a programmer is going to want is application specific, so as
far as the standard library is concerned we probably need to offer both.
(Decoders for charsets in C# and Java offer the same choice.)

As to whether the replacement option collapses maximal invalid sequences or
just replaces each invalid code unit, I have a weak preference for the latter.
(As mentioned elsewhere in this thread, the choice doesn't affect Unicode
conformance.)

> It now seems to me, ill-formed sequences should not be a big deal.

I'll remember that ;-)

> UTF-16 has it easier than UTF-8. An unpaired surrogate code point is a
> code point nonetheless, so it fits in a Mercury `char'. We can iterate
> over or count code points by considering each unpaired surrogate code
> point as one code point. (*Paired* surrogate code points are the special
> case.)
>
> UTF-8 is more difficult. A byte that doesn't participate in a valid
> encoding of a code point (a "bogus byte") is not itself a code point.
> However, when iterating over a string or counting code points, we can
> treat each bogus byte AS IF it were a code point in of itself[1]. Or:
> each valid code point has two fence posts either side of it, and each
> bogus byte also has two fence posts either side of it. The only problem
> is that in interfaces where we have used `char', we cannot return the
> bogus byte as a `char', so we would need new interfaces.
>
> It would be possible to map bogus bytes to an unused code point range
> (thus stick it in a char), but that's a big step to take.

I think we are going to require a new set of interfaces for reading
characters.  (And to clarify the behaviour of the existing ones.)

Julien.


More information about the reviews mailing list