[m-dev.] Unicode support in Mercury

Mark Brown mark at csse.unimelb.edu.au
Wed May 11 12:49:45 AEST 2011


On 10-May-2011, Matt Giuca <matt.giuca at gmail.com> wrote:
> > Dealing with surrogate pairs just means using string.index_next instead
> > of Index + 1, doesn't it?  Any other realistic code I can think of would
> > be indifferent to the units of measurement, or would be just as likely
> > to want to work with collation units as code points.
> There are lots of other subtle differences in the code. For one thing,
> if you are using length for any semantic purpose, besides just
> checking whether an index will fail, the difference between code units
> and code points will be meaningful. Here's a realistic example: say
> you are writing a Twitter client, and you have to limit strings to 140
> characters (yes, Twitter counts characters, not bytes -- I checked --
> https://twitter.com/mgiuca/status/67922010088546304). With the current
> scheme, the Chinese will be wondering why you are only letting them
> type 46 characters, unless you specifically thought about it and
> called string.count_codepoints. My point is, most people won't, and
> they'll just use string.length which will do the right thing until
> their Chinese friend points out that it doesn't.

They should be counting collation units, not code points.  That's my
point - code point counting is not that useful.

For example, a message with fewer than 140 code points may have greater
than 140 code points when normalized, because some code points (such as
"o" with an umlaut) may be separated into a code point for the base letter
followed by a code point for the accent.  It's the number of collation
units which is independent of the representation.

> Another example which happened to me. On IVLE, we had to buffer
> characters and send them over the network when enough had been
> buffered. It didn't particularly matter how big the buffer was;
> between half to one kilobyte was fine. But we did need to split up the
> string into small chunks to avoid blocking. I don't know why, but we
> weren't dealing with Python's Unicode type, but plain byte strings.

There's your type error!  You can't cast any old bytestring into a unicode
string without putting some thought into it.

> We
> cut the strings into 512-byte chunks, sent them over the network, and
> stitched them back together again on the other side (JavaScript).
> Things were going fine until one day we ran some non-ASCII tests and
> found that the occasional characters were getting corrupted, because
> their UTF-8 bytes were being split across buffers, and they were being
> converted to Unicode characters by JavaScript before being stitched
> back up. This is an example of a programmer's assumption that it was
> safe to just take indices at an arbitrary size (512 bytes), when it
> fact it wasn't safe. Had we been using true Unicode strings, we could
> have just buffered in 512-character chunks and had no problems. In the
> current Mercury implementation, this second example would manifest
> itself as random runtime failures/aborts rather than corrupted
> characters, which is not much better. Peter Schachte's suggestion (to
> abstract the index type) would fix this by not letting you write such
> code in the first place, but then how could you go about chunking a
> string into large increments at all?

A function from index -> int, which just unwraps the value.

Functions from (string, int) -> index that either fail/abort if
the integer doesn't point to the start of a code point in the string,
or return the next/previous valid index.

As I understand it, unicode is specifically designed to support these
operations efficiently.


mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au

More information about the developers mailing list