[m-dev.] Unicode support in Mercury

Matt Giuca matt.giuca at gmail.com
Tue May 10 22:19:03 AEST 2011


> Dealing with surrogate pairs just means using string.index_next instead
> of Index + 1, doesn't it?  Any other realistic code I can think of would
> be indifferent to the units of measurement, or would be just as likely
> to want to work with collation units as code points.

There are lots of other subtle differences in the code. For one thing,
if you are using length for any semantic purpose, besides just
checking whether an index will fail, the difference between code units
and code points will be meaningful. Here's a realistic example: say
you are writing a Twitter client, and you have to limit strings to 140
characters (yes, Twitter counts characters, not bytes -- I checked --
https://twitter.com/mgiuca/status/67922010088546304). With the current
scheme, the Chinese will be wondering why you are only letting them
type 46 characters, unless you specifically thought about it and
called string.count_codepoints. My point is, most people won't, and
they'll just use string.length which will do the right thing until
their Chinese friend points out that it doesn't.

Another example which happened to me. On IVLE, we had to buffer
characters and send them over the network when enough had been
buffered. It didn't particularly matter how big the buffer was;
between half to one kilobyte was fine. But we did need to split up the
string into small chunks to avoid blocking. I don't know why, but we
weren't dealing with Python's Unicode type, but plain byte strings. We
cut the strings into 512-byte chunks, sent them over the network, and
stitched them back together again on the other side (JavaScript).
Things were going fine until one day we ran some non-ASCII tests and
found that the occasional characters were getting corrupted, because
their UTF-8 bytes were being split across buffers, and they were being
converted to Unicode characters by JavaScript before being stitched
back up. This is an example of a programmer's assumption that it was
safe to just take indices at an arbitrary size (512 bytes), when it
fact it wasn't safe. Had we been using true Unicode strings, we could
have just buffered in 512-character chunks and had no problems. In the
current Mercury implementation, this second example would manifest
itself as random runtime failures/aborts rather than corrupted
characters, which is not much better. Peter Schachte's suggestion (to
abstract the index type) would fix this by not letting you write such
code in the first place, but then how could you go about chunking a
string into large increments at all?

You could say that it's the programmer's responsibility to test, but
programmers are notorious for only testing on ASCII. The point of a
good Unicode implementation is that programmers don't need to consider
non-ASCII characters to be special cases, as they are not treated
specially.

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list