[m-dev.] Unicode support in Mercury

Peter Wang novalazy at gmail.com
Mon May 9 16:13:08 AEST 2011


On 2011-05-09, Matt Giuca <matt.giuca at gmail.com> wrote:
> 
> Note that I haven't tried out the latest build yet; just going by the
> documentation.
> 
> As an example, the string.length function will work differently
> depending on the implementation. length("日本語") will be 9 on UTF-8
> builds, and 3 on UTF-16 builds, despite the fact that it only has 3
> characters. However, count_codepoints("日本語") will always be 3. It
> seems to me that length should be a synonym for count_codepoints, not
> count_code_units, as I can't see any reason to use count_code_units in
> day-to-day usage.

In my experience, `length' is mainly used for two things: checking that
an index is within range, and getting the end point when iterating over
a string.  In both cases it doesn't matter what `length' returns, but it
must be compatible with the indexing predicates.

I have little use for knowing how many code points are in a string.
In the compiler, `count_codepoints' is not used.  In the standard
library, apart from implementing `by_codepoint' predicates, it is used
for aligning strings badly (making the incorrect assumption that all
code points take the same width when printed).

> The current implementation also breaks the invariant
> that string.length(S) = list.length(string.to_char_list(S)), whereas
> this would fix that.

You get two invariants instead ;-)

> Even more broken is string.index, which deals with code units but
> returns a code point. This implies that index_det("日本語", 0) = '日',

Yes.

> index_det("日本語", 1) = '日', index_det("日本語", 2) = '日',

These are invalid, not so different to index_det("日本語", -1) or
index_det("日本語", 100).

> index_det("日本語",
> 3) = '本', and so on.

Yes.

> I can't even find an index_by_codepoint function,
> which I would expect would work such that
> index_by_codepoint_det("日本語", 0) = '日', index_by_codepoint_det("日本語",
> 1) = '本', and so on, but again, I would prefer for this behaviour to
> be the default for the index function.

You have to use string.codepoint_offset.  Consider it a tax on
inefficient code.

> I see a number of "by_codepoint" versions of functions, but I would
> argue that these should all be the defaults, and "by_code_unit"
> variants could be added if desired.
> 
> Fundamentally, a Unicode string should be conceptually a list of
> characters, where each character is a single code point. The
> underlying encoding is irrelevant to the user unless they specifically
> ask to get the text out as a byte sequence. Several of the string
> functions have changed to support this view: to_char_list and foldl,
> for example, but the majority are supporting a low-level
> implementation-dependent view of strings.
> 
> Having said this, I imagine the reason for it is performance --
> dealing with codepoints rather than code units requires linear time,
> so that probably justifies it. Still, I feel that the code_unit
> variants are basically meaningless, so they shouldn't be the defaults.

Yes, performance is what it comes down to.  But IMHO working with code
unit offsets is not much harder than working with code unit offsets.

I find it helps to think more abstractly of the offset arguments.
The set of valid offsets for a string is not all of [0, length(Str)-1],
but some subset.  0 is the still the start of the string.  length(Str)
is still the end.  The next offset after N is _not_ N + 1, but
`index_next' gives you the value anyway.  The actual value is
immaterial.

You can still do useful arithmetic with the offsets, like subtracting
offsets to get the length of the substring.  If you know you are dealing
with an ASCII code point (extremely common), you can still add 1.

> One last issue is the use of the term "codepoint". The new functions
> seem to be differentiating between "codepoint" and "code unit", but I
> would suggest that "character" is a better name than "codepoint". In
> Unicode, "codepoint" refers to the integer value assigned to a
> character. Thinking abstractly, a string is a sequence of characters,
> not a sequence of codepoints. From the Unicode Standard 6.0, section
> 2.4:
> 
> > The range of integers used to code the abstract characters is called the codespace. A particular
> > integer in this set is called a code point. When an abstract character is mapped or
> > assigned to a particular code point in the codespace, it is then referred to as an encoded
> > character.
> 
> This would fit well with Mercury's existing terminology for
> "character". As of the recent patch, Mercury's 'char' data type now
> corresponds to the Unicode definition of a character.

Perhaps.  On the other hand `char' can represent code points to which no
abstract characters have been assigned.

Peter
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list