[m-dev.] Re: Unicode escapes

Ian MacLarty maclarty at csse.unimelb.edu.au
Mon Mar 5 09:39:24 AEDT 2007


On Sun, Mar 04, 2007 at 09:05:56PM +1100, Simon Taylor wrote:
> Hi,
> 
> A change last year introduced "\uXXXX" escapes for Unicode code points.
> The implementation of these for the C back-end is incorrect; it dumps
> UTF-8 encoded code points into strings without regard for the encoding
> of the source file being read, and the string library doesn't handle
> variable-length characters.
> 

I don't really understand what the encoding of the source file has to do
with anything.  The whole idea is to make the internal representation of
strings independent of the source file encoding.

> Also, string.to_upper and string.to_lower are broken even on the
> Unicode-aware Java and .NET backends, because in some languages (e.g.
> German and Greek) case conversion can cause one code point to become
> multiple code points or vice versa.  The correct approach is to convert
> the entire string as a unit.  Any code which deals with individual code
> points rather than strings in Unicode is probably wrong, because of
> things like combining diacriticals and surrogate code points.

What code are you thinking of that deals with individual code points?
There are no standard library functions to do operations on individual
unicode characters at the moment

> 
> The NEWS file and reference manual imply that Mercury stores strings
> internally as UTF-8 encoded Unicode, when it really doesn't for the
> C backends.  You might get away with it if all your text files are
> UTF-8 encoded and you don't mess with the contents of strings too
> much, but that's not really good enough for a documented language
> feature.
> 
> I'm not suggesting the removal of Unicode escapes if MC has a use for
> them in their current state, but they shouldn't be documented until
> Unicode support is more fully implemented.
> 

What do you suggest needs to be added to make unicode support "more
fully implemented"?  I was intending adding some more unicode support
functions at some stage, for example a foldl over all the unicode
characters in a string and a unicode_length function for determining how
many unicode characters are in a string.

Ian.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list