[m-dev.] Unicode support in Mercury

Matt Giuca matt.giuca at gmail.com
Mon May 9 13:00:02 AEST 2011

Previous message: [m-dev.] further proposed library changes
Next message: [m-dev.] Unicode support in Mercury
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I just caught up on the recent development history in Mercury and
found out about Unicode support. I gather that this patch has been
accepted but wasn't released with 11.01, so there is still time to
comment on it.

I think this is a positive development, but I just wanted to comment
on the distinction the "string" module makes between "code units" and
"code points". It seems like all of the "old" string functions deal
with code units, whereas the new functions deal with code points. This
seems to be at odds with the whole concept of Unicode, as the whole
point of Unicode is that code units are an irrelevant implementation
detail (depending on which encoding the implementation uses). A
language that supports true Unicode strings will abstract away the
details of the code units and let the programmer deal solely with code
points.

Note that I haven't tried out the latest build yet; just going by the
documentation.

As an example, the string.length function will work differently
depending on the implementation. length("日本語") will be 9 on UTF-8
builds, and 3 on UTF-16 builds, despite the fact that it only has 3
characters. However, count_codepoints("日本語") will always be 3. It
seems to me that length should be a synonym for count_codepoints, not
count_code_units, as I can't see any reason to use count_code_units in
day-to-day usage. The current implementation also breaks the invariant
that string.length(S) = list.length(string.to_char_list(S)), whereas
this would fix that.

Even more broken is string.index, which deals with code units but
returns a code point. This implies that index_det("日本語", 0) = '日',
index_det("日本語", 1) = '日', index_det("日本語", 2) = '日', index_det("日本語",
3) = '本', and so on. I can't even find an index_by_codepoint function,
which I would expect would work such that
index_by_codepoint_det("日本語", 0) = '日', index_by_codepoint_det("日本語",
1) = '本', and so on, but again, I would prefer for this behaviour to
be the default for the index function.

I see a number of "by_codepoint" versions of functions, but I would
argue that these should all be the defaults, and "by_code_unit"
variants could be added if desired.

Fundamentally, a Unicode string should be conceptually a list of
characters, where each character is a single code point. The
underlying encoding is irrelevant to the user unless they specifically
ask to get the text out as a byte sequence. Several of the string
functions have changed to support this view: to_char_list and foldl,
for example, but the majority are supporting a low-level
implementation-dependent view of strings.

Having said this, I imagine the reason for it is performance --
dealing with codepoints rather than code units requires linear time,
so that probably justifies it. Still, I feel that the code_unit
variants are basically meaningless, so they shouldn't be the defaults.

One last issue is the use of the term "codepoint". The new functions
seem to be differentiating between "codepoint" and "code unit", but I
would suggest that "character" is a better name than "codepoint". In
Unicode, "codepoint" refers to the integer value assigned to a
character. Thinking abstractly, a string is a sequence of characters,
not a sequence of codepoints. From the Unicode Standard 6.0, section
2.4:

> The range of integers used to code the abstract characters is called the codespace. A particular
> integer in this set is called a code point. When an abstract character is mapped or
> assigned to a particular code point in the codespace, it is then referred to as an encoded
> character.

This would fit well with Mercury's existing terminology for
"character". As of the recent patch, Mercury's 'char' data type now
corresponds to the Unicode definition of a character. I think the
'string' module should match this terminology. For example
'count_codepoints' should be 'count_characters' and
'split_by_codepoint' should be 'split_by_character'. (Though this will
tend to be a non-issue if these functions are made the default as I
suggest above.)

Matt Giuca
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions: mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------

Previous message: [m-dev.] further proposed library changes
Next message: [m-dev.] Unicode support in Mercury
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the developers mailing list