[m-dev.] Re: Unicode escapes

Ian MacLarty maclarty at csse.unimelb.edu.au
Mon Mar 5 15:23:30 AEDT 2007

On Mon, Mar 05, 2007 at 11:48:01AM +1100, Simon Taylor wrote:
> On 05-Mar-2007, Ian MacLarty <maclarty at cs.mu.oz.au> wrote:
> > On Sun, Mar 04, 2007 at 09:05:56PM +1100, Simon Taylor wrote:
> > > A change last year introduced "\uXXXX" escapes for Unicode code points.
> > > The implementation of these for the C back-end is incorrect; it dumps
> > > UTF-8 encoded code points into strings without regard for the encoding
> > > of the source file being read, and the string library doesn't handle
> > > variable-length characters.
> > 
> > I don't really understand what the encoding of the source file has to do
> > with anything.  The whole idea is to make the internal representation of
> > strings independent of the source file encoding.
> Mercury strings on the C backends aren't independent of the encoding
> of their source because io.m doesn't do any conversion when it reads
> strings in or writes them out.

But if you use \uXXXX in a string literal in your source file it will
be represented the same way no matter what the encoding of the source
file.  Granted, if you feed mmc a UTF-16 file, then I guess it will barf,
since it'll try and interpret each byte as a character, however I think
that's a different problem.  If, for example, you used \u00a9 in an
ISO 8859-1 file or a UTF-8 file, you'll get the same internal
representation.  If you used the \x... escapes, or just typed the
character in directly, then you'd probably get different

> Compare the output of the two programs below when executed on a
> system for which UTF-8 is not the default encoding for text files.
> The Mercury program generates gibberish, the Java program produces
> a single '@' character.

Are you sure it's gibberish?  Won't it be the UTF-8 encoding of the
copyright sign?  If you redirect the output to a file and open it in an
application that understands UTF-8, then don't you get a copyright sign?

> copyright.m
> ===========
> :- module copyright.
> :- interface.
> :- import_module io.
> :- pred main(io::di, io::uo) is det.
> :- implementation.
> main(!IO) :-
> 	% U+00A9 is the copyright symbol.
> 	io.write_string("\u00a9\n", !IO).
> copyright.java
> ==============
> class Copyright {
> 	public static void main(String[] args)  {
> 		System.out.println("\u00a9");
> 	}
> }
> > > Also, string.to_upper and string.to_lower are broken even on the
> > > Unicode-aware Java and .NET backends, because in some languages (e.g.
> > > German and Greek) case conversion can cause one code point to become
> > > multiple code points or vice versa.  The correct approach is to convert
> > > the entire string as a unit.  Any code which deals with individual code
> > > points rather than strings in Unicode is probably wrong, because of
> > > things like combining diacriticals and surrogate code points.
> > 
> > What code are you thinking of that deals with individual code points?
> > There are no standard library functions to do operations on individual
> > unicode characters at the moment
> string.to_upper converts the string to a list of characters, then calls
> char.to_upper on each character, which on the Java and .NET backends is
> a (possibly surrogate) code point.
> This could be considered a documentation problem; it hasn't been updated
> to make clear that {string,char}.to_upper only work on unaccented Latin
> letters, even on backends for which char represents a Unicode code point.

Yes, I think the documentation needs updating.

> > > The NEWS file and reference manual imply that Mercury stores strings
> > > internally as UTF-8 encoded Unicode, when it really doesn't for the
> > > C backends.  You might get away with it if all your text files are
> > > UTF-8 encoded and you don't mess with the contents of strings too
> > > much, but that's not really good enough for a documented language
> > > feature.
> > > 
> > > I'm not suggesting the removal of Unicode escapes if MC has a use for
> > > them in their current state, but they shouldn't be documented until
> > > Unicode support is more fully implemented.
> > 
> > What do you suggest needs to be added to make unicode support "more
> > fully implemented"?  I was intending adding some more unicode support
> > functions at some stage, for example a foldl over all the unicode
> > characters in a string and a unicode_length function for determining how
> > many unicode characters are in a string.
> * Input and output conversion.

Perhaps what we need are some stream instances for doing this.  For
example if you write to a UTF-16 stream, then the internal UTF-8
representation will be converted to UTF-16.

> * `char' would need to represent a code point, not a byte in the UTF-8
>   representation, which would require changes in functions like string.index.

I was intending to add a separate code_point type for that kind of
thing.  I think modifying the meaning of char would be too big a change.

> * Functions like string.to_upper should do the right thing for languages
>   other than English.

I think maybe the problem is lack of clarity in the documentation.  Char
is not supposed to represent a unicode code point.  It represents a code
unit only.

The problem the \uXXXX escape sequence is trying to solve is how to
allow users to use unicode characters in string literals in a way
that's independent of the encoding of the source file.  Do you not agree
that using \uXXXX is preferable to using \x... or typing in the
character directly and relying on the encoding used by your editor?

> The best approach would probably be to make string.m use the ICU library
> <http://icu.sourceforge.net>, but that would be a lot of work.

That seems like a good idea (maybe it would be a unicode module and not
the string module?).

If you insist I can remove the \uXXXX escape sequence.

mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au

More information about the developers mailing list