[m-dev.] Unicode support in Mercury

Mark Brown mark at csse.unimelb.edu.au
Mon May 9 17:12:15 AEST 2011

Hi Matt,

On 09-May-2011, Matt Giuca <matt.giuca at gmail.com> wrote:
> Fundamentally, a Unicode string should be conceptually a list of
> characters, where each character is a single code point. The
> underlying encoding is irrelevant to the user unless they specifically
> ask to get the text out as a byte sequence.

They may simply wish to know where the bytes start in memory - that's
relevant to a lexer even though it doesn't care about the byte sequence,
since it wants to peek at or read the next character quickly.

> Several of the string
> functions have changed to support this view: to_char_list and foldl,
> for example, but the majority are supporting a low-level
> implementation-dependent view of strings.
> Having said this, I imagine the reason for it is performance --

Exactly.  :-)

> dealing with codepoints rather than code units requires linear time,
> so that probably justifies it. Still, I feel that the code_unit
> variants are basically meaningless, so they shouldn't be the defaults.

Equality and comparison of code-unit indexes is still meaningful, even
if arithmetic isn't.

My experience with this is that the change broke the G12 lexer, which
was always incrementing the index by 1 (an operation that is now
meaningless, as you say) instead of by the number of bytes in the
character.  This results in an illegal character abort for unicode
input.  While the conceptually-nice use of code-point indexes would
have meant the lexer would continue to work correctly without
modifications, its performance would suck even for ascii.  That
would be a costlier problem to fix than the abort.

So I think the right call has been made, with string.index atleast.


mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au

More information about the developers mailing list