[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Peter Wang novalazy at gmail.com
Wed Sep 11 12:50:21 AEST 2019


On Wed, 11 Sep 2019 11:24:46 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
> 
> Hi Peter,
> 
> Some thoughts on this.
> 
> On Thu, 5 Sep 2019, Peter Wang wrote:
> 
> > Back when I introduced Unicode string support to Mercury, I left the
> > question of what to do about strings containing invalid encodings for
> > future work.
> >
> > A tempting option is to say that Mercury strings MUST always contain
> > valid UTF-8 or UTF-16.
> 
> For most predicates that return strings, that's precisely what should
> occur.
> 
> > Any foreign code that creates a string would need to validate it; good luck
> > with that. The compiler could generate code to automatically validate any
> > string returned by a foreign_proc, but it is still inefficient. And then
> > what, throw an exception?
> 
> Foreign procs that wrap code that is known to return well encoded strings
> should not have to carry out such checks.  Ideally, we would add an additional
> foreign code attribute that allows the programmer to specify whether the
> Mercury compiler inserts such checks on string output arguments for
> foreign_procs,
> 
> As to what should be done when an encoding error is detected, there seem
> to be two reasonable behaviours:
> 
>    1. Throw an exception.
>    2. Replace the invalid part of the string with a replacement character.
> 
> Which one of these a programmer is going to want is application specific, so as
> far as the standard library is concerned we probably need to offer both.
> (Decoders for charsets in C# and Java offer the same choice.)
> 
> As to whether the replacement option collapses maximal invalid sequences or
> just replaces each invalid code unit, I have a weak preference for the latter.
> (As mentioned elsewhere in this thread, the choice doesn't affect Unicode
> conformance.)
> 
> > It now seems to me, ill-formed sequences should not be a big deal.
> 
> I'll remember that ;-)
> 

It seems we disagree. I don't want to another foreign_proc annotation if
we can help it, and I don't want an (implicit) source of exceptions if
we can help it.

What would guaranteeing string validity buy you?
My opinion is: not a lot.

  - The mental model is a bit simpler, though you need to worry more
    about the boundary between Mercury and foreign code.
  - Maybe some code can go a bit faster.
  - We can be sure to avoid writing out "garbage" (which may not be
    garbage, just text in another encoding), but downstream processors
    need to be prepared for ill-formed sequences anyway.

On the other hand, allowing ill-formed sequences makes the string type
and its operations more useful. It should be possible (and not hard)
to work on the ASCII or well-formed subparts of a string and leave the
ill-formed subparts intact. It's rather annoying when programs choke
because the language or some API insisted on Unicode validation,
when it is actually unnecessary to the operation of the program.
(An example is the Mercury compiler choking on non-UTF-8 characters
in comments ;)

Peter


More information about the reviews mailing list