[m-dev.] Unicode support in Mercury
Mark Brown
mark at csse.unimelb.edu.au
Tue May 10 19:41:22 AEST 2011
On 10-May-2011, Paul Bone <pbone at csse.unimelb.edu.au> wrote:
> On Tue, May 10, 2011 at 10:16:28AM +1000, Peter Schachte wrote:
> > On 09/05/11 22:11, Matt Giuca wrote:
> > > Here he proposes three alternatives: #1 which is to leave the
> > > interface as-is and let users deal with surrogate pairs. This is the
> > > current position of Mercury (except in UTF-8 backends, Mercury lets
> > > users deal with individual bytes rather than surrogate pairs).
Dealing with surrogate pairs just means using string.index_next instead
of Index + 1, doesn't it? Any other realistic code I can think of would
be indifferent to the units of measurement, or would be just as likely
to want to work with collation units as code points.
> > > #2, as
> > > I have been suggesting, which is to keen the UTF-16 backend but change
> > > the interface to deal in characters/codepoints -- as Guido says this
> > > is bad due to performance, and I'm backing down on this proposal now.
> > > #3 which is to change the implementation to UTF-32 and have a
> > > character/codepoint interface, which Guido calls "the ideal
> > > situation".
I don't regard this as ideal. It seems a lot of trouble to go to
just to let people write Index + 1.
Ideally we would have added string.index_next a long time ago, and just
made it add one, so that conscientious programmers could have written
future-proof code all along. Oh well.
> >
> > What about a fourth option: make the string index type an abstract type
> > (stopping anyone from using +1 on it), and implement next_index to take a
> > string and an index, and return a new index, etc.? Keep the efficient
> > implementation, but hide the index type.
Not a bad idea. It would force programmers to update any code that currently
increments the index--such code may well be broken already.
>
> I like this suggestion for the ease of programming with it and for keeping an
> iteration over a string to constant time.
The existing design of index_next has both of these properties. ;-)
>
> The tricky part is how to migrate users to this option, that's a hard problem
> and not the sort of problem that I'm much good at solving.
>
If there's any possibility of making this change in future, it would be
wise to add
:- type char_index == int.
to the string module now, so that aforementioned conscientious programmers
can start using it at their leisure and not have their code break when the
change occurs.
Cheers,
Mark.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions: mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
More information about the developers
mailing list