[m-dev.] Re: Unicode escapes

Simon Taylor staylr at gmail.com
Tue Mar 6 10:41:03 AEDT 2007


On 05-Mar-2007, Ian MacLarty <maclarty at csse.unimelb.edu.au> wrote:
> But if you use \uXXXX in a string literal in your source file it will
> be represented the same way no matter what the encoding of the source
> file.  Granted, if you feed mmc a UTF-16 file, then I guess it will barf,
> since it'll try and interpret each byte as a character, however I think
> that's a different problem.  If, for example, you used \u00a9 in an
> ISO 8859-1 file or a UTF-8 file, you'll get the same internal
> representation.  If you used the \x... escapes, or just typed the
> character in directly, then you'd probably get different
> representations.

I'm not complaining about it having the same internal representation.
I'm complaining about it being given the same external representation.

It doesn't make sense to talk about putting the character '\u00a9' in an
ISO 8859-1 file or a UTF-8 file.  An encoding of that character is put
in the file which will differ depending on the target character set and
encoding.  The encoding will fail if the character doesn't have a
representation in the target character set.  The default character set
and encoding are set by the POSIX locale, and that is what text-based
programs are expected to read and write unless the locale is explicitly
overridden.

> > Compare the output of the two programs below when executed on a
> > system for which UTF-8 is not the default encoding for text files.
> > The Mercury program generates gibberish, the Java program produces
> > a single '@' character.
> 
> Are you sure it's gibberish?  Won't it be the UTF-8 encoding of the
> copyright sign?  If you redirect the output to a file and open it in an
> application that understands UTF-8, then don't you get a copyright sign?

It's gibberish to any application that expects to read the system's
default text encoding.

> > * Input and output conversion.
> 
> Perhaps what we need are some stream instances for doing this.  For
> example if you write to a UTF-16 stream, then the internal UTF-8
> representation will be converted to UTF-16.

But what is the default input and output representation?

We could take the C approach of making people read in the bytes and then
explicitly put a stream on top of that to convert it into a Unicode
encoding, but who wants to see the raw bytes?  It's another step for the
user that makes all this Unicode stuff seem too hard.

I would suggest adding a variant of io.open_input which takes an
encoding to override the default, and io.open_text_as_binary_input which
would give access to the bytes but still perform newline conversion.
Corresponding predicates would be needed for output streams.

> > * `char' would need to represent a code point, not a byte in the UTF-8
> >   representation, which would require changes in functions like string.index.
 
> I was intending to add a separate code_point type for that kind of
> thing.  I think modifying the meaning of char would be too big a change.

`char' is already used in that sense for the .NET and Java backends. 
 
> > * Functions like string.to_upper should do the right thing for languages
> >   other than English.
> 
> I think maybe the problem is lack of clarity in the documentation.  Char
> is not supposed to represent a unicode code point.  It represents a code
> unit only.

We completely disagree on what char means.  A char is an entity that has
some useful testable properties.  It doesn't make sense to call
is_whitespace or is_alpha on a code unit (for example a byte in the
UTF-8 representation of a character), because a code unit is an internal
representation issue that users of a string type shouldn't even see.

I think Unicode code points are a little too low level as well.  The
behaviour of some Unicode code points (such as combining characters)
is context dependent to the point that it might be better to dump the
character type entirely.  Instead you would have an abstract
`type char_index == pair(string, index)' where the index points to
the first code point of a character and its possible surrogate code
point and combining character code points.  There would be semidet
functions `next' and `previous' which jump to the first code point
of the next or previous character in the string.

But I don't know enough about oddball writing systems to design this
properly, so Unicode code points are probably the best we can do.

> The problem the \uXXXX escape sequence is trying to solve is how to
> allow users to use unicode characters in string literals in a way
> that's independent of the encoding of the source file. 

Unicode escapes aren't the problem.  The problem is that Unicode escapes
are a small part of process of making Mercury understand Unicode, and
they have been added and documented without the rest of the changes
needed to make them usable except in a very restricted set of
cirumstances.  This gives a user reading the reference manual and NEWS
file the impression that Mercury understands Unicode, when it really
doesn't.

If you haven't already, the Plan9 UTF-8 paper is worth a read:
<http://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf>.

> Do you not agree that using \uXXXX is preferable to using \x... or
> typing in the character directly and relying on the encoding used
> by your editor?

Creating a random jumble of UTF-8 and an 8-bit character set like we
do now is far, far worse.

> > The best approach would probably be to make string.m use the ICU library
> > <http://icu.sourceforge.net>, but that would be a lot of work.
> 
> That seems like a good idea (maybe it would be a unicode module and not
> the string module?).

It's 2007.  Code that doesn't handle Unicode strings is broken.
Mercury only does things this way because Unicode wasn't in a usable
state back in 1993.  We could supply a `cstring' module so that people
can choose to do things the old way if they have a good reason, but
it shouldn't be the default.

Simon.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the developers mailing list