[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Peter Wang novalazy at gmail.com
Wed Sep 11 16:15:23 AEST 2019


On Wed, 11 Sep 2019 14:06:51 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
> 
> Hi Peter,
> 
> On Wed, 11 Sep 2019, Peter Wang wrote:
> 
> > It seems we disagree. I don't want to another foreign_proc annotation if
> > we can help it, and I don't want an (implicit) source of exceptions if
> > we can help it.
> >
> > What would guaranteeing string validity buy you?
> > My opinion is: not a lot.
> >
> >  - The mental model is a bit simpler, though you need to worry more
> >    about the boundary between Mercury and foreign code.
> >  - Maybe some code can go a bit faster.
> >  - We can be sure to avoid writing out "garbage" (which may not be
> >    garbage, just text in another encoding), but downstream processors
> >    need to be prepared for ill-formed sequences anyway.
> >
> > On the other hand, allowing ill-formed sequences makes the string type
> > and its operations more useful. It should be possible (and not hard)
> > to work on the ASCII or well-formed subparts of a string and leave the
> > ill-formed subparts intact.
> 
> Are you saying that is the only behaviour the standard library should
> support or *a* behaviour?  I'm fine with the latter, not the former.

I'm saying we should NOT make any guarantee that Mercury strings
only contain well-formed code unit sequences. Without that guarantee,
predicates/functions that take strings as inputs must be prepared to deal
with ill-formed code unit sequences in some way (without data loss when
possible). Without that guarantee, there is no need to validate every
string on creation.

> > It's rather annoying when programs choke because the language or some
> > API insisted on Unicode validation, when it is actually unnecessary to
> > the operation of the program.
> > (An example is the Mercury compiler choking on non-UTF-8 characters
> > in comments ;)
> 
> That's perfectly consistent with the way Mercury is defined!  That said,
> in the specific case of non-UTF-8 characters in comments, the compiler
> should probably recover a little more gracefully than it currently does.

All right, not the best example since we defined Mercury source files
to contain UTF-8. But rejecting non-UTF-8 chars is better done at the
parser level (as it is now), and that requires strings to be able to
hold the non-UTF-8 sequences so that the parser can see them.

(In this particular case, io.read_file_as_string et al could
replace the non-UTF-8 sequences with replacement chars
but other programs would want the original bytes available.)

Peter


More information about the reviews mailing list