[m-dev.] Unicode escapes

Simon Taylor staylr at gmail.com
Sun Mar 4 21:05:56 AEDT 2007


Hi,

A change last year introduced "\uXXXX" escapes for Unicode code points.
The implementation of these for the C back-end is incorrect; it dumps
UTF-8 encoded code points into strings without regard for the encoding
of the source file being read, and the string library doesn't handle
variable-length characters.

Also, string.to_upper and string.to_lower are broken even on the
Unicode-aware Java and .NET backends, because in some languages (e.g.
German and Greek) case conversion can cause one code point to become
multiple code points or vice versa.  The correct approach is to convert
the entire string as a unit.  Any code which deals with individual code
points rather than strings in Unicode is probably wrong, because of
things like combining diacriticals and surrogate code points.

The NEWS file and reference manual imply that Mercury stores strings
internally as UTF-8 encoded Unicode, when it really doesn't for the
C backends.  You might get away with it if all your text files are
UTF-8 encoded and you don't mess with the contents of strings too
much, but that's not really good enough for a documented language
feature.

I'm not suggesting the removal of Unicode escapes if MC has a use for
them in their current state, but they shouldn't be documented until
Unicode support is more fully implemented.

Simon.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list