[m-rev.] for review: improve unicode support

Peter Wang novalazy at gmail.com
Mon Apr 4 17:16:03 AEST 2011


On 2011-04-01, Julien Fischer <juliensf at csse.unimelb.edu.au> wrote:
> 
> Hi Peter,
> 
> Thanks for all your work on this!
> 
> On Mon, 28 Mar 2011, Peter Wang wrote:
> 
> >Declare that we use the Unicode character set, and UTF-8 or UTF-16 for the
> >internal string representation (depending on the backend).  User code may be
> >written to those assumptions.  Other external encodings can be supported in
> >the future by translating to/from Unicode internally.
> >
> >The `char' type now represents a Unicode code point.
> >
> >NOTE: questions about how to handle unpaired surrogate code points, etc. have
> >been left for later.
> 
> ...
> 
> >diff --git a/runtime/mercury_string.c b/runtime/mercury_string.c
> >index 0f20219..e7ca64e 100644
> >--- a/runtime/mercury_string.c
> >+++ b/runtime/mercury_string.c

> >+
> >+size_t
> >+MR_utf8_encode(char s_[], int c)
> >+{
> >+    unsigned char *s = (unsigned char *)s_;
> >+    unsigned int uc = c;
> 
> Shouldn't the second argument have type MR_Char?
> (And likewise in the function prototype for this.)

Fixed.

> >diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> >index b12afbe..0a99f2d 100644
> >--- a/runtime/mercury_string.h
> >+++ b/runtime/mercury_string.h
> >@@ -15,24 +15,21 @@
> >#include <stdarg.h>
> >
> >/*
> >-** Mercury characters are given type `MR_Char', which is a typedef for `char'.
> >-** But BEWARE: when stored in an MR_Integer, the value must be
> >-** first cast to `MR_UnsignedChar'.
> >-** Mercury strings are stored as pointers to '\0'-terminated arrays of MR_Char.
> >+** Mercury characters (Unicode code points) are given type `MR_Char', which is
> >+** a typedef for `int'.
> 
> The assumption here being that int is a 32-bit quantity?  (That's almost
> certainly true for anything Mercury currently runs on -- it's probably
> worth documenting that assumption here though.)

I changed the typedefs to MR_[u]int_least32_t.

> 
> In conclusion, a couple of things:
> - you should add an entry to the NEWS file concerning the addition of
>   Unicode support

Done.

> - does tabling of characters work correctly?

Yes.

> 
> Otherwise, that looks fine -- please bootcheck it on a number of
> platforms, in a number of grdes before committing though.

Committed.  Phew.

Peter
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list