[m-dev.] Re: Unicode escapes

Simon Taylor staylr at gmail.com
Mon Mar 5 11:48:01 AEDT 2007

On 05-Mar-2007, Ian MacLarty <maclarty at cs.mu.oz.au> wrote:
> On Sun, Mar 04, 2007 at 09:05:56PM +1100, Simon Taylor wrote:
> > A change last year introduced "\uXXXX" escapes for Unicode code points.
> > The implementation of these for the C back-end is incorrect; it dumps
> > UTF-8 encoded code points into strings without regard for the encoding
> > of the source file being read, and the string library doesn't handle
> > variable-length characters.
> I don't really understand what the encoding of the source file has to do
> with anything.  The whole idea is to make the internal representation of
> strings independent of the source file encoding.

Mercury strings on the C backends aren't independent of the encoding
of their source because io.m doesn't do any conversion when it reads
strings in or writes them out.

Compare the output of the two programs below when executed on a
system for which UTF-8 is not the default encoding for text files.
The Mercury program generates gibberish, the Java program produces
a single '@' character.

:- module copyright.

:- interface.

:- import_module io.

:- pred main(io::di, io::uo) is det.

:- implementation.

main(!IO) :-
	% U+00A9 is the copyright symbol.
	io.write_string("\u00a9\n", !IO).

class Copyright {
	public static void main(String[] args)  {

> > Also, string.to_upper and string.to_lower are broken even on the
> > Unicode-aware Java and .NET backends, because in some languages (e.g.
> > German and Greek) case conversion can cause one code point to become
> > multiple code points or vice versa.  The correct approach is to convert
> > the entire string as a unit.  Any code which deals with individual code
> > points rather than strings in Unicode is probably wrong, because of
> > things like combining diacriticals and surrogate code points.
> What code are you thinking of that deals with individual code points?
> There are no standard library functions to do operations on individual
> unicode characters at the moment

string.to_upper converts the string to a list of characters, then calls
char.to_upper on each character, which on the Java and .NET backends is
a (possibly surrogate) code point.

This could be considered a documentation problem; it hasn't been updated
to make clear that {string,char}.to_upper only work on unaccented Latin
letters, even on backends for which char represents a Unicode code point.

> > The NEWS file and reference manual imply that Mercury stores strings
> > internally as UTF-8 encoded Unicode, when it really doesn't for the
> > C backends.  You might get away with it if all your text files are
> > UTF-8 encoded and you don't mess with the contents of strings too
> > much, but that's not really good enough for a documented language
> > feature.
> > 
> > I'm not suggesting the removal of Unicode escapes if MC has a use for
> > them in their current state, but they shouldn't be documented until
> > Unicode support is more fully implemented.
> What do you suggest needs to be added to make unicode support "more
> fully implemented"?  I was intending adding some more unicode support
> functions at some stage, for example a foldl over all the unicode
> characters in a string and a unicode_length function for determining how
> many unicode characters are in a string.

* Input and output conversion.
* `char' would need to represent a code point, not a byte in the UTF-8
  representation, which would require changes in functions like string.index.
* Functions like string.to_upper should do the right thing for languages
  other than English.
* etc.

The best approach would probably be to make string.m use the ICU library
<http://icu.sourceforge.net>, but that would be a lot of work.

mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au

More information about the developers mailing list