[m-dev.] Unicode support in Mercury

Mark Brown mark at csse.unimelb.edu.au
Wed May 11 19:06:02 AEST 2011
Previous message: [m-dev.] Unicode support in Mercury
Next message: [m-dev.] Unicode support in Mercury
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 11-May-2011, Matt Giuca <matt.giuca at gmail.com> wrote:
> > They should be counting collation units, not code points. ?That's my
> > point - code point counting is not that useful.
> >
> > For example, a message with fewer than 140 code points may have greater
> > than 140 code points when normalized, because some code points (such as
> > "o" with an umlaut) may be separated into a code point for the base letter
> > followed by a code point for the accent. ?It's the number of collation
> > units which is independent of the representation.
> 
> Sorry, I misunderstood your earlier email. I assumed by "collation
> unit" you meant "code unit"; I haven't heard the term "collation unit"
> before used to mean "a character with all of its combining characters
> as a single unit," which is what I assume you mean now.
> 

(Actually, the correct term may be "composite character".  Collation unit is
a similar concept.  The terminology is from the possibly-too-subtle-for-me
unicode spec.)

> So having covered that, I agree with Mark's assessment that collation
> units are more useful (semantically) than code points, because
> normalising a string's combining characters does not change the number
> of collation units, but does change the number of code points.
> 
> But I feel that this is taking the argument off topic somewhat,
> because I was originally making a distinction between code units and
> code points, not between code points and collation units.  You said
> "code point counting is not that useful" but I would argue it is far
> more useful than code unit counting, which was the problem in the
> first place.

FWIW, I'm interpreting the topic as being a choice between two styles
of interface, array-like and stream-like.

    Array-like: you access any individual char by supplying an integer
    index from a contiguous range.

    Stream-like: you can access the first character or the next character,
    but no others.

I agree with you that we want a "truly abstract" interface, but that's
a moot point since either interface can be made that way.

I interpret your argument as: "if the implementation is utf32, then we
can provide an array-like interface without the programmer having to
deal with surrogate pairs".  (My apologies in advance if I've
misrepresented your position.)

My response is essentially that surrogate pairs aren't the only problem
that programmer's have to deal with; there's also composite characters
and collation units.  So the array-like interface is likely to be not
quite what they want anyway.

In contrast, the stream-like interface works naturally with different
units.  Whether you scan a code unit, composite character or collation
unit, its still just a "next index" that you get back.

> >
> > A function from index -> int, which just unwraps the value.
> >
> > Functions from (string, int) -> index that either fail/abort if
> > the integer doesn't point to the start of a code point in the string,
> > or return the next/previous valid index.
> 
> One problem with this second function is that you would not be able to
> use an index on a string without checking again that it points to the
> start of a codepoint. The reason being that you could convert the int
> into an index with regards to one string, and then use the index to
> access a different string. Therefore, both the (string, int) -> index
> function and the string index function would both need a runtime check
> and a fail/abort condition.

That's a property of the underlying encoding, not the interface.  There's
no reason why the stream interface can't be used with utf32 as the
underlying encoding.

> 
> > As I understand it, unicode is specifically designed to support these
> > operations efficiently.
> 
> Unicode isn't designed to support any operations efficiently.

>From Section 2.2, "Unicode Design Principles":

    The Unicode Standard is designed to make efficient implementation
    possible ... All Unicode encoding forms are self-synchronizing
    and non-overlapping.  This makes randomly accessing and searching
    inside streams of characters efficient.

Cheers,
Mark.

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
Previous message: [m-dev.] Unicode support in Mercury
Next message: [m-dev.] Unicode support in Mercury
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the developers mailing list