[m-rev.] for review: add some unicode support to Mercury

Peter Moulder Peter.Moulder at infotech.monash.edu.au
Fri Jul 21 17:24:44 AEST 2006


On Fri, Jul 21, 2006 at 11:01:50AM +1000, Ian MacLarty wrote:

> +++ NEWS	20 Jul 2006 08:21:32 -0000
...
> +  unicode character is implementation dependent.  For the Melbourne Mercury
> +  compiler unicode characters are represented using UTF-8 for the C backends.
             ^,

Missing commma after subordinate clause, making it harder to parse.

> +For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
                                     ^,
> +for the Java and IL backends.

Same thing here (even though there's less parsing difficulty here).

> +            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
> +        then
> +            get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,

Minor stylistic issue: should probably s/UTF8/UTF/ above, and similarly
in string_get_unicode_escape.

I've assumed that most code (get_unicode_escape,
string_get_unicode_escape and encode_unicode_char_as_utf8) is otherwise
unchanged, and haven't checked them a second time.

> +allowed_unicode_char_code(Code) :-
> +    Code >= 0,
> +    % The following ranges are reserved for utf-16 surrogates.
> +    not (
> +        Code >= 0xD800, Code =< 0xDBFF
> +    ;
> +        Code >= 0xDC00, Code =< 0xDFFF
> +    ),

This is numerically just a single range, so I'd be inclined to write
that disjunction as a single range check,
‘not (Code >= 0xD800, Code =< 0xDFFF)’,
or push in that ‘not’ giving ‘( Code < 0xD800 ; 0xDFFF < Code )’.


[Interesting point about micro-optimization: 
 If I do a manual, natural translation of the above to 
 the natural translation to C of the above `not ( ... ; ... )' (namely
 `!((0xD800 <= x && x <= 0xDBFF) || (0xDC00 <= x && x <= 0xDFFF))')
 produces good assembly code from gcc -O2 (tried both v3.3 and v4.1): a
 subtract and then a comparison, apparently identical to the code emitted
 for `x < 0xD800 || 0xDFFF < x'.

 Whereas `mmc --make tst.o && objdump -S Mercury/os/tst.o' (where mmc
 --output-grade-string gives hlc.gc and mgnuc does gcc-4.1 -O2 among
 other flags) gives a fairly literal translation of the Mercury code to
 assembler.

 This (together with my assumptions as to what assembly code has better
 runtime characteristics) shows that there is payoff in translating to
 natural-looking C code even just from runtime characteristics, let alone
 the benefits of readable code for debugging or language translation.

 If programmers aren't confident that the right machine code is going to
 be produced, then they spend time considering alternatives like
 ‘(Code /\ 0xF800) \= 0xD800’: optimization isn't just about faster &
 smaller executable code, it's about allowing easier-to-read code and
 less time spent thinking about trade-offs.]


It's not clear to me how allowed_unicode_char_code is supposed to
behave.  It currently returns false for >0x10ffff, <0, surrogate pair
codes, but returns true for 0xfffe, 0xffff, 0x10fffe, 0x10ffff, which
are "guaranteed not to be a Unicode character at all".  

The practical effect of different allowed_unicode_char_code behaviour is
what \u/\U escapes are valid.  By comparison, Java (which, I should say,
was written before utf16 existed) allows all 16-bit values, including
\uFFFF and \uD800.  I haven't yet looked at other languages/syntaxes
that allow \u or \U escapes.

> +:- pragma foreign_proc("Java",
> +    backend_unicode_encoding_int = (EncodingInt::out),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> +"
> +    EncodingInt = 1;
> +").
> +
> +:- pragma foreign_proc("Java",
> +    backend_unicode_encoding_int = (EncodingInt::out),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> +"
> +    EncodingInt = 1;
> +").

Copy & paste bug: the above two are identical.
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list