[m-dev.] proposed unicode support for Mercury

Peter Wang novalazy at gmail.com
Fri Mar 4 13:53:12 AEDT 2011


On 2011-03-04, Peter Moulder <peter.moulder at monash.edu> wrote:
> On Fri, Mar 04, 2011 at 11:47:35AM +1100, Peter Wang wrote:
> 
> > Since there are no objections, now seems as good a time as any.
> > 11.01 betas are available for users who don't want to be affected
> > for now.
> > 
> > Does anyone want to do a detailed review of the patch, or should I just
> > commit?  Of course I will bootcheck.  Any particular settings I should
> > check?
> 
> I do have reservations about the proposal in general, as well as comments
> on the patch itself; but for some reason my message didn't come through
> to the mailing list.  (Maybe something to do with Monash's continuing
> efforts to mung mail headers.)  I'm sending this to Peter Wang as well;
> please forward to the list if necessary.
> 
> How is it proposed for Mercury programs to deal with bytes as distinct
> from unicode characters?

The `bitmap' type or another dedicated type.

string is already not the right choice.  The string module tries quite
hard to disallow NUL bytes in C grades, and using strings to hold bytes
is inefficient in Java and C# grades.

> 
> What will be the approach to strings that aren't valid utf-n strings?

My patch was not quite decided on this.  Perhaps we should throw an
exception if when we come across an illegal sequence, or try our best to
muddle on, e.g. returning U+FFFD and skipping to the next code point.

> 
> Will there be strings that the Mercury runtime can assume to be validly
> encoded (so that extracting code points can be quick) ?

You mean, a separate subtype of strings?  I don't think it's necessary.
Do you have a situation in mind?

Since the interesting characters are almost always in the ASCII range,
it is possible to use string.unsafe_index_code_unit as an optimisation
where necessary, regardless of whether the whole string is valid.

> Why is a char forbidden to be >0x10ffff but allowed to be a surrogate
> code point (which can't actually be written in valid utf-n)?

I think the answer is that surrogate code points are still code points
nonetheless, so Mercury char should be able to represent them.
If we disallow surrogates then we should also disallow U+FFFE and U+FFFF
(and more?)

> Incidentally, there is a bug in the to-utf8 predicate in the patch: it
> can create invalid utf-8 if the input is a surrogate code point.

True.

> I have significant doubts that changing `char' to mean `code point'; I
> believe a new type should be introduced.

I think code points correspond most closely to what people consider
"characters" so `char' is the right choice.  Character literals have
type `char' so your change would be even bigger, I think.

I don't see any need for a new type for code units, rather than int.
I only used string.unsafe_index_code_unit in two situations in the
string module:

- computing the hash of a string: code units treated as integers
- optimising string.prefix/suffix: code units are compared for equality

I also mentioned another potential use earlier, but again, a separate
code unit type would not help.

> (And I believe the Mercury
> compiler should continue to deal with code units rather than code points;
> more detail given in the message I sent previously).

I didn't receive your message.

Peter
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list