[m-dev.] Unicode support in Mercury

Matt Giuca matt.giuca at gmail.com
Wed May 11 13:38:03 AEST 2011


> They should be counting collation units, not code points.  That's my
> point - code point counting is not that useful.
>
> For example, a message with fewer than 140 code points may have greater
> than 140 code points when normalized, because some code points (such as
> "o" with an umlaut) may be separated into a code point for the base letter
> followed by a code point for the accent.  It's the number of collation
> units which is independent of the representation.

Sorry, I misunderstood your earlier email. I assumed by "collation
unit" you meant "code unit"; I haven't heard the term "collation unit"
before used to mean "a character with all of its combining characters
as a single unit," which is what I assume you mean now.

So let's be perfectly clear about the representation of the following
string: "café".
Note that I could have represented this string as "café", using U+00E9
to represent the "é" (LATIN SMALL LETTER E WITH ACUTE), but I am
deliberately using the decomposed form, "e" (U+0065 LATIN SMALL LETTER
E) + "◌́" (U+0301 COMBINING ACUTE ACCENT).

This string has four collation units, by Mark's terminology: "c", "a",
"f", "é".
This string has five code points, U+0063 U+0061 U+0066 U+0065 U+0301.
The fourth and fifth code points comprise the collation unit "é".
With a UTF-8 encoding, this string has six code units (six bytes): 63
61 66 65 CC 81. The fifth and sixth code units comprise the code point
U+0301.
With a UTF-16 encoding, this string has five code units (ten bytes):
0063 0061 0066 0065 0301.

So having covered that, I agree with Mark's assessment that collation
units are more useful (semantically) than code points, because
normalising a string's combining characters does not change the number
of collation units, but does change the number of code points.

But I feel that this is taking the argument off topic somewhat,
because I was originally making a distinction between code units and
code points, not between code points and collation units. You said
"code point counting is not that useful" but I would argue it is far
more useful than code unit counting, which was the problem in the
first place.

Furthermore, Unicode and all applications I know of consider a
"character" to be a code point, not a collation unit. While code
points are not *representation* independent (they are subject to
normalisation), they are *implementation* independent. It is usually
sufficient to expose users to non-normalised representations; since
they type each code point in with a keyboard they are already exposed
to separate combining characters -- but it is not normally acceptable
to expose users to implementation details of code units.

Returning to the Twitter example, this page on Twitter claims that
Twitter normalises collation units:
http://dev.twitter.com/pages/counting_characters
But in my experience, it does not. Twitter imposes a 140 *code point*
limit. Not a 140 *byte* limit or a 140 *code unit* limit or a 140
*collation unit* limit. It is fixated on the code point which is the
same level at which users interact with strings when typing them on a
keyboard. In other words, if I type "café" by hitting the keys C, A,
F, E and then Ctrl+Shift+U301, I have entered five "characters" and
this will cost me five of my 140 code points.

>
> A function from index -> int, which just unwraps the value.
>
> Functions from (string, int) -> index that either fail/abort if
> the integer doesn't point to the start of a code point in the string,
> or return the next/previous valid index.

One problem with this second function is that you would not be able to
use an index on a string without checking again that it points to the
start of a codepoint. The reason being that you could convert the int
into an index with regards to one string, and then use the index to
access a different string. Therefore, both the (string, int) -> index
function and the string index function would both need a runtime check
and a fail/abort condition.

> As I understand it, unicode is specifically designed to support these
> operations efficiently.

Unicode isn't designed to support any operations efficiently. Various
encodings (UTF-8, UTF-16, etc) are designed to support certain
operations efficiently, with each having strengths and weaknesses (for
example, UTF-32 is designed for efficient access by codepoint index;
UTF-8 is designed for compact representation and ASCII compatibility).
Unicode is just designed to provide an abstract interface over a
sequence of characters.

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list