[m-rev.] for review: improve unicode support

Julien Fischer juliensf at csse.unimelb.edu.au
Fri Apr 1 10:16:23 AEDT 2011


Hi Peter,

Thanks for all your work on this!

On Mon, 28 Mar 2011, Peter Wang wrote:

> Declare that we use the Unicode character set, and UTF-8 or UTF-16 for the
> internal string representation (depending on the backend).  User code may be
> written to those assumptions.  Other external encodings can be supported in
> the future by translating to/from Unicode internally.
>
> The `char' type now represents a Unicode code point.
>
> NOTE: questions about how to handle unpaired surrogate code points, etc. have
> been left for later.

...

> diff --git a/runtime/mercury_string.c b/runtime/mercury_string.c
> index 0f20219..e7ca64e 100644
> --- a/runtime/mercury_string.c
> +++ b/runtime/mercury_string.c

> +int
> +MR_utf8_get(const MR_String s_, int pos)
> +{
> +    const unsigned char *s = (const unsigned char *)s_;
> +    int c;
> +    int remain;
> +    int minc;
> +    int i;
> +
> +    c = s[pos];
> +
> +    if (c <= 0x7F) {
> +        /* Plain ASCII (including NUL terminator). */
> +        return c;
> +    }
> +
> +    if (c <= 0xC1) {
> +        /* Trailing byte of multi-byte sequence or an overlong encoding for
> +         * code point <= 127.
> +         */
> +        return -2;
> +    }
> +
> +    if (c <= 0xDF) {
> +        /* 2-byte sequence. */
> +        c &= 0x1F;
> +        remain = 1;
> +        minc = 0x80;
> +    }
> +    else if (c <= 0xEF) {
> +        /* 3-byte sequence. */
> +        c &= 0x0F;
> +        remain = 2;
> +        minc = 0x800;
> +    }
> +    else if (c <= 0xF4) {
> +        /* 4-byte sequence. */
> +        c &= 0x07;
> +        remain = 3;
> +        minc = 0x10000;
> +    }
> +    else {
> +        /* Otherwise invalid. */
> +        return -2;
> +    }
> +
> +    for (i = 1; i <= remain; i++) {
> +        if (s[pos + i] == '\0') {
> +            return -2;
> +        }
> +    }
> +
> +    while (remain--) {
> +        int d = s[++pos];
> +
> +        if (!MR_utf8_is_trail_byte(d)) {
> +            return -2;
> +        }
> +
> +        c = (c << 6) | (d & 0x3F);
> +    }
> +
> +    /* Check for overlong forms, which could be used to bypass security
> +     * validations.  We could also check code points aren't above U+10FFFF or in
> +     * the surrogate ranges, but we don't.
> +     */
> +
> +    if (c < minc) {
> +        return -2;
> +    }
> +
> +    return c;
> +}
> +
> +int
> +MR_utf8_get_next(const MR_String s, int *pos)
> +{
> +    int c = MR_utf8_get(s, *pos);
> +
> +    if (c >= 0) {
> +        (*pos) += MR_utf8_width(c);
> +        return c;
> +    }
> +
> +    /* Some invalid byte sequence. */
> +    MR_utf8_next(s, pos);
> +    return c;
> +}
> +
> +int
> +MR_utf8_prev_get(const MR_String s, int *pos)
> +{
> +    if (MR_utf8_prev(s, pos)) {
> +        return MR_utf8_get(s, *pos);
> +    }
> +
> +    /* Past beginning. */
> +    return -1;
> +}
> +
> +
> +size_t
> +MR_utf8_width(int c)
> +{
> +    /* So we don't need to check for negative values nor use unsigned ints
> +     * in the interface, which are a pain.
> +     */
> +    unsigned int uc = c;
> +
> +    if (uc <= 0x7f) {
> +        return 1;
> +    }
> +    if (uc <= 0x7ff) {
> +        return 2;
> +    }
> +    if (uc <= 0xffff) {
> +        return (MR_is_surrogate(uc)) ? 0 : 3;
> +    }
> +    if (uc <= 0x10ffff) {
> +        return 4;
> +    }
> +    /* The rest are illegal. */
> +    return 0;
> +}
> +
> +size_t
> +MR_utf8_encode(char s_[], int c)
> +{
> +    unsigned char *s = (unsigned char *)s_;
> +    unsigned int uc = c;

Shouldn't the second argument have type MR_Char?
(And likewise in the function prototype for this.)

...

> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index b12afbe..0a99f2d 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -15,24 +15,21 @@
> #include <stdarg.h>
>
> /*
> -** Mercury characters are given type `MR_Char', which is a typedef for `char'.
> -** But BEWARE: when stored in an MR_Integer, the value must be
> -** first cast to `MR_UnsignedChar'.
> -** Mercury strings are stored as pointers to '\0'-terminated arrays of MR_Char.
> +** Mercury characters (Unicode code points) are given type `MR_Char', which is
> +** a typedef for `int'.

The assumption here being that int is a 32-bit quantity?  (That's almost
certainly true for anything Mercury currently runs on -- it's probably
worth documenting that assumption here though.)

In conclusion, a couple of things:
- you should add an entry to the NEWS file concerning the addition of
   Unicode support
- does tabling of characters work correctly?

Otherwise, that looks fine -- please bootcheck it on a number of
platforms, in a number of grdes before committing though.

Julien.
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list