[mercury-users] Character type class

Richard A. O'Keefe ok at hermes.otago.ac.nz
Thu Jan 27 09:57:32 AEDT 2000


	What do people want to see in a character type class?
		- comparisons and orderings with respect to locale

I've got a copy of the draft ISO standard for string ordering;
perhaps someone on this list knows if it ever became an official
standard.  With the tables and examples, it ran to 150 pages.

The central point about locale-dependent *string* comparison is
that it is not simply composed from any *character* comparison.
Take the ordering for English, for example:

    Pass 1 (left to right):
	ignore everything except letters, and ignore accents and
	alphabetic case.  Ligatures such as AE, OE, IJ, DJ, ...
        are split A E, O E, I J, D J, ...

    Pass 2 (left to right):
	same thing, but this time heed accents.

    Pass 3 (left to right):
	same thing, but this time heed accents and case.

    Pass 4 (left to right):
	simple lexicographic order

although could be useful to split this into

    Pass 4 (left to right):
	ignore leading and trailing layout, compress internal
	runs of layout to single spaces, and then do simple
	lexicographic order

    Pass 5 (left to right);
	simple lexicographic order

Except for the ligature splitting, it's possible to do this in a single
pass using some auxiliary tables (I have C code to do it).  But it is
NOT a composition of any simple ordering on characters.

It gets worse:  French uses much the same scheme, but runs Pass 2
from right to left (I *don't* have C code to do that in one pass).
Other languages may use up to seven passes.

For the specific case of Maaori, "ng" and "wh" are multiple
_characters_ but count as single _letters_.
"h" precedes "i" but "wini" precedes "whai".  There are other
writing systems that do the same kind of thing; Spanish used to
do it with "ch" and "ll", although I think they officially changed that.

The point is that I don't think it's worth while trying to provide any
kind of locale-dependent ordering on characters.  Just say that the
character type represents (possibly some subset of) ISO 10646.

More than that:  the Maaori and Spanish examples demonstrate that it
is important NOT to provide a locale-dependent ordering on characters,
lest it mislead someone into thinking that it can be "lifted" to
provide a useful ordering on strings when in fact it cannot.

Never mind Maaori and Spanish.  What about English?  It's clear that
"E" < "e" by the rules above, but "eggplant" < "Elephant".

Provide locale-dependent *string* ordering (the ISO draft standard I
mentioned above has several recommended interfaces)
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list