[mercury-users] Character type class

Richard A. O'Keefe ok at hermes.otago.ac.nz
Wed Feb 2 09:58:12 AEDT 2000


Concerning Unicode and comparison, I note that there's a newish Unicode
Technical Report about a method for compressing Unicode strings.  Of
direct relevance to many programming tasks, it is aimed at compressing
*short* strings usefully, without requiring the long start-up of things
like LZW and adaptive Huffman.  There are 8 fixed 'windows' of 128 characters
and 8 dynamic windows (that can be set by byte codes); single byte codes are
used to select the appropriate window and bytes 16'80 to 16'FF are mapped
according to the current window, while CR, LF, TAB, and ' ' to DEL are left
as is.  The net effect is that strings of ISO Latin 1 printable text are
left unchanged, while strings of text in other members of ISO 8859 pay one
byte (or possibly two) of overhead to set the appropriate window and then
pay one byte per character.  Even Japanese benefits from this scheme, because
runs of kana are shortened.

As a compact representation for text in a large alphabet (yes it does handle
Plane 1 characters as well as BMP characters), there's a lot to like about
it, except of course that it defines an encoding for compression, not a
compression algorithm, so you'd have to ensure that all strings were
compressed the same way.

If I've understood it correctly, the UTF-8 encoding preserves
lexicographic order.  That is,
	s1 <= s2 if and only if UCS2_to_UTF8(s1) <= UCS2_to_UTF8(s2).
But the new compression scheme does NOT preserve lexicographic order.

So there is a definite reason to have a fast string comparison method
which is a total order over strings, but which is NOT lexicographic order
on the (UCS2 or UCS4) sequences they represent.

In particular, "a" > "bγ" (reading &gammar; as a single Greek
character), because the byte sequences would be
	(a) and (select greek window; b; shifted gamma)
and all the compression codes are less than space.

For what it's worth, people in New Zealand _would_ notice the difference.
¯ is U+0101, so 
"Mäori" would be (02,4d,81,6f,72,69)
                      SQ2    ^^        16'(0101 - 0100 + 80)
which would compare LESS than "A" (41).

Storage may be cheap, but it isn't free, and compact strings plus fast
comparison for internal structural use would be worth the price of
scrambling the M\=aori collation order provided there were a *separate*
locale-sensitive means of comparison.
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list