[m-dev.] Unicode support in Mercury

Matt Giuca matt.giuca at gmail.com
Mon May 9 22:11:28 AEST 2011


> I suspect there's quite a bit of code that (rightly or wrongly) relies
> on the old semantics.  I wouldn't change them without giving users
> a suitable number of releases in which to adapt their code.

What was the old semantics? It was code units, wasn't it? (So it was
UTF-16 units in Java, for example.) So the old semantics was
inconsistent across different backends anyway (though I suppose most
programs are only written for a single backend). Therefore, any code
which, at the moment, works on both the C and Java backend (treats
code units as abstract integers) will automatically work if changed to
the codepoint semantics.

I was thinking this could actually work in favour of a more semantic
implementation later on. If a hypothetical future implementation used
UTF-32, then code units are the same as code points, so it would have
the semantics I'm talking about.

This means that if you stick with code unit semantics, then you have
the option of, in the future, changing to code point semantics, and
technically not be breaking the existing semantics. So I'd be happy
with that.

> I don't think there's anything wrong with being able to mainpulate the
> representation of strings in Mercury.  (When interfacing with languages
> that don't natively support Unicode, e.g. C, it may be necessary.)

I wasn't arguing that you shouldn't be able to. Just that the default
string methods shouldn't deal with manipulating the underlying
representation.

On Mon, May 9, 2011 at 7:50 PM, Ian MacLarty
<maclarty at csse.unimelb.edu.au> wrote:
> I don't think this is true for Java.  The Java length method returns
> the number of code units in the string, not the number of code points
> (for that there is codePointCount).  Mercury's approach seems to me to
> be the same as Java's, except that Java uses UTF16, making it less
> likely for the length to return a different value from codePointCount.
>  See http://download.oracle.com/javase/6/docs/api/java/lang/String.html
> and http://download.oracle.com/javase/6/docs/api/java/lang/Character.html#unicode.

This is a good point, and looking over the Java specification linked
above, it looks like it is quite tied to code units, as Mercury is
presently (so I shouldn't have counted it as an example in favour of
the code point interface). But I believe the creators of Java were
*trying* to make strings based on characters and not the underlying
representation. The reason Java ended up having to deal with code
points is historical. If you look at the String documentation from
Java 2, you will see the descriptions are quite different:
http://download.oracle.com/javase/1.4.2/docs/api/

No mention of "code units"; they just talk about "characters". No
codePointCount; the description for "length" is "the number of 16-bit
Unicode characters in the string". Unfortunately, the Java developers
were victims of changing standards: Java 1 was released in January
1996 (http://en.wikipedia.org/wiki/Java_version_history). Unicode 2.0
was released in July 1996 (http://en.wikipedia.org/wiki/Unicode),
introducing surrogate pairs. So when Java 1 was released, Unicode code
points were 16-bit integers, and Java was doing the right thing and
not forcing its developers to think in terms of implementation.

Python had the exact same problem: its Unicode strings were originally
UCS-2/UTF-16 encoded, so Python shares Java's almost-correct strings,
but users are forced to deal in code units when working with strings
with characters from supplementary planes. However, the Python
developers realised this was a problem, as summarised by the project
lead Guido in this mail from 2001:
http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html

Here he proposes three alternatives: #1 which is to leave the
interface as-is and let users deal with surrogate pairs. This is the
current position of Mercury (except in UTF-8 backends, Mercury lets
users deal with individual bytes rather than surrogate pairs). #2, as
I have been suggesting, which is to keen the UTF-16 backend but change
the interface to deal in characters/codepoints -- as Guido says this
is bad due to performance, and I'm backing down on this proposal now.
#3 which is to change the implementation to UTF-32 and have a
character/codepoint interface, which Guido calls "the ideal
situation". Later that year, they implemented a compile-time switch in
Python, letting users choose a UCS2 or UCS4 build, which is still a
compile-time choice to this day. However, most distros (such as Debian
and Ubuntu) use the UCS4 build because the UCS2 build causes way too
many hassles. So I consider Python to effectively have Unicode
"correct" now.

So I agree with you guys, changing it so the default position is very
slow is not a good idea. But it would be good to leave the options
open to possibly transition to a UTF-32 representation later (though
this would surely mess with the C interface).

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list