[m-rev.] for review: add some unicode support to Mercury

Ian MacLarty maclarty at csse.unimelb.edu.au
Wed Jul 19 16:48:14 AEST 2006


Hi Peter,

Thanks for the review.  There will a bit of a change to the
implementation after an in person discussion I had with Julien.  Since
we'd like to also support the Java and IL backends and these backends
use UTF-16 for strings we will change the meaning of the \u and \U escape
sequences to be that a unicode character is inserted in the string using
the default encoding of the target backend (for the C backends this will
be UTF-8, while for IL and Java it'll be UTF-16).

This means I won't have utf8_length in the string module anymore.
Instead I'll (in a subsequent diff) add a unicode module for dealing
with unicode strings.

On Wed, Jul 19, 2006 at 04:01:57PM +1000, Peter Moulder wrote:
> On Wed, Jul 05, 2006 at 12:24:25AM +1000, Ian MacLarty wrote:
> 
> > +The sequence @samp{\x} introduces
> >  a hexadecimal escape; it must be followed by a sequence of hexadecimal
> >  digits and then a closing backslash.  It is replaced
> >  with the character whose character code is identified by the hexadecimal
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> I suggest changing to `byte whose value', to clarify that e.g. \xa0\ is
> replaced by just one byte rather than being equivalent to \u00a0.
> 

Yes okay.

> > +    % Return the number of unicode characters in a UTF-8 encoded string.
> 
> I suggest explicitly stating that the result is undefined (or
> unspecified or similar) if the given string isn't valid utf-8.  (The
> existing documentation gives the false impression that it counts only
> valid, complete unicode characters.)
> 

Okay, I'll bear that in mind when designing the unicode module.  utf8_length
won't be committed with this diff.

> > +++ tests/hard_coded/unicode.m	4 Jul 2006 10:01:15 -0000
> ...
> > +utf8_strings = [
> > +	"\u0003",
> > +	"\U00000003",
> > +	"\u0394",  % delta
> > +	"\u03A0",  % pi
> > +	"\uFFFF",
> > +	"\U0010FFFF",
> > +	"\U000ABCDE",
> > +	"r\u00E9sum\u00E9", % "resume" with accents
> > +	"abc123"
> > +].
> 
> It would be nice to add "\u005cu0041" as an example (0x5c = backslash),
> and similarly "\x5c\u0041", "\x5c\\u0041", "\\u0041" and "u0041".
> It would be good for some of these examples to use lowercase hex digits.
> 

I've added the extra tests.

I plan to post a new diff within the next week or so.

Ian.
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list