[m-rev.] for review/discussion: ill-formed sequences in Unicode strings
Julien Fischer
jfischer at opturion.com
Wed Sep 11 14:06:51 AEST 2019
Hi Peter,
On Wed, 11 Sep 2019, Peter Wang wrote:
> It seems we disagree. I don't want to another foreign_proc annotation if
> we can help it, and I don't want an (implicit) source of exceptions if
> we can help it.
>
> What would guaranteeing string validity buy you?
> My opinion is: not a lot.
>
> - The mental model is a bit simpler, though you need to worry more
> about the boundary between Mercury and foreign code.
> - Maybe some code can go a bit faster.
> - We can be sure to avoid writing out "garbage" (which may not be
> garbage, just text in another encoding), but downstream processors
> need to be prepared for ill-formed sequences anyway.
>
> On the other hand, allowing ill-formed sequences makes the string type
> and its operations more useful. It should be possible (and not hard)
> to work on the ASCII or well-formed subparts of a string and leave the
> ill-formed subparts intact.
Are you saying that is the only behaviour the standard library should
support or *a* behaviour? I'm fine with the latter, not the former.
> It's rather annoying when programs choke because the language or some
> API insisted on Unicode validation, when it is actually unnecessary to
> the operation of the program.
> (An example is the Mercury compiler choking on non-UTF-8 characters
> in comments ;)
That's perfectly consistent with the way Mercury is defined! That said,
in the specific case of non-UTF-8 characters in comments, the compiler
should probably recover a little more gracefully than it currently does.
Julien.
More information about the reviews
mailing list