[m-dev.] Unicode support in Mercury

Julien Fischer juliensf at csse.unimelb.edu.au
Mon May 9 19:31:00 AEST 2011


Hi,

On Mon, 9 May 2011, Matt Giuca wrote:

> I think this is a positive development, but I just wanted to comment
> on the distinction the "string" module makes between "code units" and
> "code points". It seems like all of the "old" string functions deal
> with code units, whereas the new functions deal with code points.

I suspect there's quite a bit of code that (rightly or wrongly) relies
on the old semantics.  I wouldn't change them without giving users
a suitable number of releases in which to adapt their code.

> This seems to be at odds with the whole concept of Unicode, as the
> whole point of Unicode is that code units are an irrelevant
> implementation detail (depending on which encoding the implementation
> uses). A language that supports true Unicode strings will abstract
> away the details of the code units and let the programmer deal solely
> with code points.
>
> Note that I haven't tried out the latest build yet; just going by the
> documentation.
>
> As an example, the string.length function will work differently
> depending on the implementation. length("ÆüËܸì") will be 9 on UTF-8
> builds, and 3 on UTF-16 builds, despite the fact that it only has 3
> characters. However, count_codepoints("ÆüËܸì") will always be 3. It
> seems to me that length should be a synonym for count_codepoints, not
> count_code_units, as I can't see any reason to use count_code_units in
> day-to-day usage. The current implementation also breaks the invariant
> that string.length(S) = list.length(string.to_char_list(S)), whereas
> this would fix that.

Changing the defintion of string.length/1 would likely break existing C
foreign_procs that rely on the result of a string.length/1 call.  (After
count_by_code_unit has been included in one more stable releases we can
consider changing the semantics of string.length/1.)

> Even more broken is string.index, which deals with code units but
> returns a code point. This implies that index_det("ÆüËܸì", 0) = 'Æü',
> index_det("ÆüËܸì", 1) = 'Æü', index_det("ÆüËܸì", 2) = 'Æü', index_det("ÆüËܸì",
> 3) = 'ËÜ', and so on. I can't even find an index_by_codepoint function,
> which I would expect would work such that
> index_by_codepoint_det("ÆüËܸì", 0) = 'Æü', index_by_codepoint_det("ÆüËܸì",
> 1) = 'ËÜ', and so on, but again, I would prefer for this behaviour to
> be the default for the index function.
>
> I see a number of "by_codepoint" versions of functions, but I would
> argue that these should all be the defaults, and "by_code_unit"
> variants could be added if desired.

I think adding the by_code_unit variants would be useful.
(I'm wondering why it's "code_unit", but "codepoint"?)

> Fundamentally, a Unicode string should be conceptually a list of
> characters, where each character is a single code point. The
> underlying encoding is irrelevant to the user unless they specifically
> ask to get the text out as a byte sequence. Several of the string
> functions have changed to support this view: to_char_list and foldl,
> for example, but the majority are supporting a low-level
> implementation-dependent view of strings.

I don't think there's anything wrong with being able to mainpulate the
representation of strings in Mercury.  (When interfacing with languages
that don't natively support Unicode, e.g. C, it may be necessary.)

Julien.


More information about the developers mailing list