[m-rev.] preliminary review: UTF-8/16 support

Paul Bone pbone at csse.unimelb.edu.au
Mon Feb 21 13:02:18 AEDT 2011


On Wed, Feb 16, 2011 at 05:19:57PM +1100, Peter Wang wrote:
> On 2010-12-01, Peter Wang <novalazy at gmail.com> wrote:
> > Here is some work in progress on UTF-8 and UTF-16 string support, in
> > case someone would like to look at it or make suggestions.
> > 
> > We would commit to using the Universal Character Set instead of
> > trying to be agnostic about string encodings.
> > Mercury `char' becomes a Unicode code point, so is re-defined as a
> > 32-bit integer (at least).
> >    
> > String indexing predicates return code points instead of code units.
> > e.g. `string.index(String, CodeUnitOffset, CodePoint)' returns the
> > code point beginning at CodeUnitOffset.
> >    
> > When you iterate over a string you should not advance to Index + 1 any
> > more, so there is are predicates which return the next/previous offset:
> >     string.index_next(String, Offset, CodePoint, NextOffset)
> >     string.prev_index(String, NextOffset, CodePoint, Offset)

We should probably add these to the API ASAP so that people can begin using
them and be ready for unicode support once it's afailable.

The overall intention of this patch looks good.

> Newer patch as requested by Julien.
> 
> Peter
> 
> index 6beb095..23d3884 100644
> --- a/compiler/elds_to_erlang.m
> +++ b/compiler/elds_to_erlang.m
> @@ -1178,7 +1178,7 @@ write_with_escaping(StringOrAtom, String, !IO) :-
>  write_with_escaping_2(StringOrAtom, Char, !IO) :-
>      char.to_int(Char, Int),
>      (
> -        32 =< Int, Int =< 126,
> +        32 =< Int, % Int =< 126,
>          Char \= ('\\'),
>          (
>              StringOrAtom = in_string,

Is this an intentional change? is the commenting-out here temporary?

> diff --git a/compiler/mlds_to_java.m b/compiler/mlds_to_java.m
> index 3004247..cf9db1c 100644
> --- a/compiler/mlds_to_java.m
> +++ b/compiler/mlds_to_java.m
> @@ -3449,7 +3449,8 @@ type_to_string(Info, MLDS_Type, String, ArrayDims) :-
>          ArrayDims = []
>      ;
>          MLDS_Type = mlds_native_char_type,
> -        String = "char",
> +        % Java `char' not large enough for code points so we must use `int'.
> +        String = "int",
>          ArrayDims = []
>      ;
>          MLDS_Type = mlds_foreign_type(ForeignType),

Really?  I thought Java used UTF-16 and therefore char should be wide enough.
Although if char is 16 bits wide I agree, that's too small.  Perhaps I don't
understand have Java's multibyte character support works.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20110221/7e1679fa/attachment.sig>


More information about the reviews mailing list