[m-dev.] Unicode support in Mercury

Mark Brown mark at csse.unimelb.edu.au
Tue May 10 19:41:22 AEST 2011


On 10-May-2011, Paul Bone <pbone at csse.unimelb.edu.au> wrote:
> On Tue, May 10, 2011 at 10:16:28AM +1000, Peter Schachte wrote:
> > On 09/05/11 22:11, Matt Giuca wrote:
> > > Here he proposes three alternatives: #1 which is to leave the
> > > interface as-is and let users deal with surrogate pairs. This is the
> > > current position of Mercury (except in UTF-8 backends, Mercury lets
> > > users deal with individual bytes rather than surrogate pairs).

Dealing with surrogate pairs just means using string.index_next instead
of Index + 1, doesn't it?  Any other realistic code I can think of would
be indifferent to the units of measurement, or would be just as likely
to want to work with collation units as code points.

> > > #2, as
> > > I have been suggesting, which is to keen the UTF-16 backend but change
> > > the interface to deal in characters/codepoints -- as Guido says this
> > > is bad due to performance, and I'm backing down on this proposal now.
> > > #3 which is to change the implementation to UTF-32 and have a
> > > character/codepoint interface, which Guido calls "the ideal
> > > situation".

I don't regard this as ideal.  It seems a lot of trouble to go to
just to let people write Index + 1.

Ideally we would have added string.index_next a long time ago, and just
made it add one, so that conscientious programmers could have written
future-proof code all along.  Oh well.

> > 
> > What about a fourth option:  make the string index type an abstract type
> > (stopping anyone from using +1 on it), and implement next_index to take a
> > string and an index, and return a new index, etc.?  Keep the efficient
> > implementation, but hide the index type.

Not a bad idea.  It would force programmers to update any code that currently
increments the index--such code may well be broken already.

> 
> I like this suggestion for the ease of programming with it and for keeping an
> iteration over a string to constant time.

The existing design of index_next has both of these properties.  ;-)

> 
> The tricky part is how to migrate users to this option, that's a hard problem
> and not the sort of problem that I'm much good at solving.
> 

If there's any possibility of making this change in future, it would be
wise to add

    :- type char_index == int.

to the string module now, so that aforementioned conscientious programmers
can start using it at their leisure and not have their code break when the
change occurs.

Cheers,
Mark.

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list