[m-rev.] for review/discussion: ill-formed sequences in Unicode strings
Peter Wang
novalazy at gmail.com
Wed Sep 11 16:15:23 AEST 2019
On Wed, 11 Sep 2019 14:06:51 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
>
> Hi Peter,
>
> On Wed, 11 Sep 2019, Peter Wang wrote:
>
> > It seems we disagree. I don't want to another foreign_proc annotation if
> > we can help it, and I don't want an (implicit) source of exceptions if
> > we can help it.
> >
> > What would guaranteeing string validity buy you?
> > My opinion is: not a lot.
> >
> > - The mental model is a bit simpler, though you need to worry more
> > about the boundary between Mercury and foreign code.
> > - Maybe some code can go a bit faster.
> > - We can be sure to avoid writing out "garbage" (which may not be
> > garbage, just text in another encoding), but downstream processors
> > need to be prepared for ill-formed sequences anyway.
> >
> > On the other hand, allowing ill-formed sequences makes the string type
> > and its operations more useful. It should be possible (and not hard)
> > to work on the ASCII or well-formed subparts of a string and leave the
> > ill-formed subparts intact.
>
> Are you saying that is the only behaviour the standard library should
> support or *a* behaviour? I'm fine with the latter, not the former.
I'm saying we should NOT make any guarantee that Mercury strings
only contain well-formed code unit sequences. Without that guarantee,
predicates/functions that take strings as inputs must be prepared to deal
with ill-formed code unit sequences in some way (without data loss when
possible). Without that guarantee, there is no need to validate every
string on creation.
> > It's rather annoying when programs choke because the language or some
> > API insisted on Unicode validation, when it is actually unnecessary to
> > the operation of the program.
> > (An example is the Mercury compiler choking on non-UTF-8 characters
> > in comments ;)
>
> That's perfectly consistent with the way Mercury is defined! That said,
> in the specific case of non-UTF-8 characters in comments, the compiler
> should probably recover a little more gracefully than it currently does.
All right, not the best example since we defined Mercury source files
to contain UTF-8. But rejecting non-UTF-8 chars is better done at the
parser level (as it is now), and that requires strings to be able to
hold the non-UTF-8 sequences so that the parser can see them.
(In this particular case, io.read_file_as_string et al could
replace the non-UTF-8 sequences with replacement chars
but other programs would want the original bytes available.)
Peter
More information about the reviews
mailing list