[m-rev.] for review/discussion: ill-formed sequences in Unicode strings

Julien Fischer jfischer at opturion.com
Wed Sep 11 14:06:51 AEST 2019


Hi Peter,

On Wed, 11 Sep 2019, Peter Wang wrote:

> It seems we disagree. I don't want to another foreign_proc annotation if
> we can help it, and I don't want an (implicit) source of exceptions if
> we can help it.
>
> What would guaranteeing string validity buy you?
> My opinion is: not a lot.
>
>  - The mental model is a bit simpler, though you need to worry more
>    about the boundary between Mercury and foreign code.
>  - Maybe some code can go a bit faster.
>  - We can be sure to avoid writing out "garbage" (which may not be
>    garbage, just text in another encoding), but downstream processors
>    need to be prepared for ill-formed sequences anyway.
>
> On the other hand, allowing ill-formed sequences makes the string type
> and its operations more useful. It should be possible (and not hard)
> to work on the ASCII or well-formed subparts of a string and leave the
> ill-formed subparts intact.

Are you saying that is the only behaviour the standard library should
support or *a* behaviour?  I'm fine with the latter, not the former.

> It's rather annoying when programs choke because the language or some
> API insisted on Unicode validation, when it is actually unnecessary to
> the operation of the program.
> (An example is the Mercury compiler choking on non-UTF-8 characters
> in comments ;)

That's perfectly consistent with the way Mercury is defined!  That said,
in the specific case of non-UTF-8 characters in comments, the compiler
should probably recover a little more gracefully than it currently does.

Julien.


More information about the reviews mailing list