[mercury-users] Character type class
Richard A. O'Keefe
ok at hermes.otago.ac.nz
Thu Jan 27 09:57:32 AEDT 2000
What do people want to see in a character type class?
- comparisons and orderings with respect to locale
I've got a copy of the draft ISO standard for string ordering;
perhaps someone on this list knows if it ever became an official
standard. With the tables and examples, it ran to 150 pages.
The central point about locale-dependent *string* comparison is
that it is not simply composed from any *character* comparison.
Take the ordering for English, for example:
Pass 1 (left to right):
ignore everything except letters, and ignore accents and
alphabetic case. Ligatures such as AE, OE, IJ, DJ, ...
are split A E, O E, I J, D J, ...
Pass 2 (left to right):
same thing, but this time heed accents.
Pass 3 (left to right):
same thing, but this time heed accents and case.
Pass 4 (left to right):
simple lexicographic order
although could be useful to split this into
Pass 4 (left to right):
ignore leading and trailing layout, compress internal
runs of layout to single spaces, and then do simple
lexicographic order
Pass 5 (left to right);
simple lexicographic order
Except for the ligature splitting, it's possible to do this in a single
pass using some auxiliary tables (I have C code to do it). But it is
NOT a composition of any simple ordering on characters.
It gets worse: French uses much the same scheme, but runs Pass 2
from right to left (I *don't* have C code to do that in one pass).
Other languages may use up to seven passes.
For the specific case of Maaori, "ng" and "wh" are multiple
_characters_ but count as single _letters_.
"h" precedes "i" but "wini" precedes "whai". There are other
writing systems that do the same kind of thing; Spanish used to
do it with "ch" and "ll", although I think they officially changed that.
The point is that I don't think it's worth while trying to provide any
kind of locale-dependent ordering on characters. Just say that the
character type represents (possibly some subset of) ISO 10646.
More than that: the Maaori and Spanish examples demonstrate that it
is important NOT to provide a locale-dependent ordering on characters,
lest it mislead someone into thinking that it can be "lifted" to
provide a useful ordering on strings when in fact it cannot.
Never mind Maaori and Spanish. What about English? It's clear that
"E" < "e" by the rules above, but "eggplant" < "Elephant".
Provide locale-dependent *string* ordering (the ISO draft standard I
mentioned above has several recommended interfaces)
--------------------------------------------------------------------------
mercury-users mailing list
post: mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe: Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
More information about the users
mailing list