[m-rev.] for review: add some unicode support to Mercury

Ian MacLarty maclarty at csse.unimelb.edu.au
Thu Jul 27 11:09:06 AEST 2006


On Fri, Jul 21, 2006 at 05:24:44PM +1000, Peter Moulder wrote:
> On Fri, Jul 21, 2006 at 11:01:50AM +1000, Ian MacLarty wrote:
> 
> > +++ NEWS	20 Jul 2006 08:21:32 -0000
> ...
> > +  unicode character is implementation dependent.  For the Melbourne Mercury
> > +  compiler unicode characters are represented using UTF-8 for the C backends.
>              ^,
> 
> Missing commma after subordinate clause, making it harder to parse.
> 

Fixed.

> > +For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
>                                      ^,
> > +for the Java and IL backends.
> 
> Same thing here (even though there's less parsing difficulty here).
> 

Fixed.

> > +            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
> > +        then
> > +            get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
> 
> Minor stylistic issue: should probably s/UTF8/UTF/ above, and similarly
> in string_get_unicode_escape.
> 

Done.

> I've assumed that most code (get_unicode_escape,
> string_get_unicode_escape and encode_unicode_char_as_utf8) is otherwise
> unchanged, and haven't checked them a second time.
> 

Yes.

> > +allowed_unicode_char_code(Code) :-
> > +    Code >= 0,
> > +    % The following ranges are reserved for utf-16 surrogates.
> > +    not (
> > +        Code >= 0xD800, Code =< 0xDBFF
> > +    ;
> > +        Code >= 0xDC00, Code =< 0xDFFF
> > +    ),
> 
> This is numerically just a single range, so I'd be inclined to write
> that disjunction as a single range check,
> ???not (Code >= 0xD800, Code =< 0xDFFF)???,
> or push in that ???not??? giving ???( Code < 0xD800 ; 0xDFFF < Code )???.
> 

So it is.  The documentation gives two ranges (one for high surrogates
and one for low surrogates).  I didn't notice they were one range.

> 
> [Interesting point about micro-optimization: 
>  If I do a manual, natural translation of the above to 
>  the natural translation to C of the above `not ( ... ; ... )' (namely
>  `!((0xD800 <= x && x <= 0xDBFF) || (0xDC00 <= x && x <= 0xDFFF))')
>  produces good assembly code from gcc -O2 (tried both v3.3 and v4.1): a
>  subtract and then a comparison, apparently identical to the code emitted
>  for `x < 0xD800 || 0xDFFF < x'.
> 
>  Whereas `mmc --make tst.o && objdump -S Mercury/os/tst.o' (where mmc
>  --output-grade-string gives hlc.gc and mgnuc does gcc-4.1 -O2 among
>  other flags) gives a fairly literal translation of the Mercury code to
>  assembler.
> 
>  This (together with my assumptions as to what assembly code has better
>  runtime characteristics) shows that there is payoff in translating to
>  natural-looking C code even just from runtime characteristics, let alone
>  the benefits of readable code for debugging or language translation.
> 
>  If programmers aren't confident that the right machine code is going to
>  be produced, then they spend time considering alternatives like
>  ???(Code /\ 0xF800) \= 0xD800???: optimization isn't just about faster &
>  smaller executable code, it's about allowing easier-to-read code and
>  less time spent thinking about trade-offs.]
> 
> 
> It's not clear to me how allowed_unicode_char_code is supposed to
> behave.  It currently returns false for >0x10ffff, <0, surrogate pair
> codes, but returns true for 0xfffe, 0xffff, 0x10fffe, 0x10ffff, which
> are "guaranteed not to be a Unicode character at all".  
> 

It fails for code points that are illegal.  The code points about which
you speak above of are either reserved for private use (i.e. an
application can assign any character it wants to those code points) or
are "special" characters.  As far as I understand it these are still
legal code points.

> The practical effect of different allowed_unicode_char_code behaviour is
> what \u/\U escapes are valid.  By comparison, Java (which, I should say,
> was written before utf16 existed) allows all 16-bit values, including
> \uFFFF and \uD800.  I haven't yet looked at other languages/syntaxes
> that allow \u or \U escapes.

I don't think we should allow code points reserved for surrogates
because they are used to encode other code points (so if you reversed
the encoding you'd never get the surrogate code points back out again).
However, we should allow private use code points, since users may have
assigned special meaning to them (for example I believe Apple uses one
of the private code points for its logo).

> 
> > +:- pragma foreign_proc("Java",
> > +    backend_unicode_encoding_int = (EncodingInt::out),
> > +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> > +"
> > +    EncodingInt = 1;
> > +").
> > +
> > +:- pragma foreign_proc("Java",
> > +    backend_unicode_encoding_int = (EncodingInt::out),
> > +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> > +"
> > +    EncodingInt = 1;
> > +").
> 
> Copy & paste bug: the above two are identical.

Fixed.

Here's the interdiff:

diff -u NEWS NEWS
--- NEWS	20 Jul 2006 08:21:32 -0000
+++ NEWS	27 Jul 2006 00:45:04 -0000
@@ -299,7 +299,7 @@
   (or XXXXXXXX) is a Unicode character code in hexidecimal, is replaced with
   the corresponding Unicode character.  The encoding used to represent the
   Unicode character is implementation dependent.  For the Melbourne Mercury
-  compiler Unicode characters are represented using UTF-8 for the C backends.
+  compiler, Unicode characters are represented using UTF-8 for the C backends.
 
 Changes to the Mercury standard library:
 
diff -u doc/reference_manual.texi doc/reference_manual.texi
--- doc/reference_manual.texi	20 Jul 2006 07:58:33 -0000
+++ doc/reference_manual.texi	27 Jul 2006 00:44:16 -0000
@@ -246,7 +246,7 @@
 @samp{\U} must be followed by the Unicode character code expressed as eight
 hexidecimal digits.
 The encoding used for Unicode characters is implementation dependent.
-For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
+For the Melbourne Mercury compiler, it is UTF-8 for the C backends and UTF-16
 for the Java and IL backends.
 
 A backslash followed immediately by a newline is deleted; thus an
diff -u library/lexer.m library/lexer.m
--- library/lexer.m	20 Jul 2006 07:13:33 -0000
+++ library/lexer.m	27 Jul 2006 00:47:56 -0000
@@ -989,9 +989,9 @@
         rev_char_list_to_string(HexChars, HexString),
         ( if
             string.base_string_to_int(16, HexString, UnicodeCharCode),
-            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+            convert_unicode_char_to_target_chars(UnicodeCharCode, UTFChars)
         then
-            get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
+            get_quoted_name(QuoteChar, list.reverse(UTFChars) ++ Chars,
                 Token, !IO)
         else
             Token = error("invalid unicode character code")
@@ -1025,9 +1025,9 @@
         rev_char_list_to_string(HexChars, HexString),
         ( if
             string.base_string_to_int(16, HexString, UnicodeCharCode),
-            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+            convert_unicode_char_to_target_chars(UnicodeCharCode, UTFChars)
         then
-            RevCharsWithUnicode = list.reverse(UTF8Chars) ++ Chars,
+            RevCharsWithUnicode = list.reverse(UTFChars) ++ Chars,
             string_get_quoted_name(String, Len, QuoteChar, RevCharsWithUnicode,
                 Posn0, Token, Context, !Posn)
         else
@@ -1066,6 +1066,11 @@
 
 encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
     allowed_unicode_char_code(UnicodeCharCode),
+    %
+    % Refer to table 3-5 of the Unicode 4.0.0 standard (available from
+    % www.unicode.org) for documentation on the bit distribution patterns used
+    % below.
+    %
     ( if UnicodeCharCode =< 0x00007F then
         UTF8Chars = [char.det_from_int(UnicodeCharCode)]
     else if UnicodeCharCode =< 0x0007FF then
@@ -1101,29 +1106,45 @@
     %
 encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
     allowed_unicode_char_code(UnicodeCharCode),
+    %
+    % If the code point is less than or equal to 0xFFFF
+    % then the UTF-16 encoding is simply the code point value,
+    % otherwise we construct a surrogate pair.
+    %
     ( if UnicodeCharCode =< 0xFFFF then
         char.det_from_int(UnicodeCharCode, Char),
         UTF16Chars = [Char]
     else
-        LeadOffset = 0xDB00 - (0x10000 >> 10),
-        Lead = LeadOffset + (UnicodeCharCode >> 10),
-        Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
-        char.det_from_int(Lead, LeadChar),
-        char.det_from_int(Trail, TrailChar),
-        UTF16Chars = [LeadChar, TrailChar]
+        %
+        % Refer to table 3-4 of the Unicode 4.0.0 standard (available from
+        % www.unicode.org) for documentation on the bit distribution patterns
+        % used below.
+        %
+        UUUUU =      (0b111110000000000000000 /\ UnicodeCharCode) >> 16,
+        XXXXXX =     (0b000001111110000000000 /\ UnicodeCharCode) >> 10,
+        XXXXXXXXXX = (0b000000000001111111111 /\ UnicodeCharCode),
+        WWWWW = UUUUU - 1,
+        Surrogate1Lead = 0b1101100000000000,
+        Surrogate2Lead = 0b1101110000000000,
+        Surrogate1 = Surrogate1Lead \/ (WWWWW << 6) \/ XXXXXX,
+        Surrogate2 = Surrogate2Lead \/ XXXXXXXXXX,
+        char.det_from_int(Surrogate1, Surrogate1Char),
+        char.det_from_int(Surrogate2, Surrogate2Char),
+        UTF16Chars = [Surrogate1Char, Surrogate2Char]
     ).
 
 :- pred allowed_unicode_char_code(int::in) is semidet.
 
+    % Succeeds if the give code point is a legal Unicode code point
+    % (regardless of whether it is reserved for private use or not).
+    %
 allowed_unicode_char_code(Code) :-
     Code >= 0,
-    % The following ranges are reserved for utf-16 surrogates.
+    Code =< 0x10FFFF,
+    % The following range is reserved for surrogates.
     not (
-        Code >= 0xD800, Code =< 0xDBFF
-    ;
-        Code >= 0xDC00, Code =< 0xDFFF
-    ),
-    Code =< 0x10FFFF.
+        Code >= 0xD800, Code =< 0xDFFF
+    ).
 
 :- type unicode_encoding
     --->    utf8
@@ -1163,7 +1184,7 @@
     EncodingInt = 1;
 ").
 
-:- pragma foreign_proc("Java",
+:- pragma foreign_proc("C#",
     backend_unicode_encoding_int = (EncodingInt::out),
     [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
 "
diff -u tests/hard_coded/Mmakefile tests/hard_coded/Mmakefile
--- tests/hard_coded/Mmakefile	20 Jul 2006 07:58:41 -0000
+++ tests/hard_coded/Mmakefile	27 Jul 2006 00:42:59 -0000
@@ -214,7 +214,7 @@
 	type_spec_ho_term \
 	type_spec_modes \
 	type_to_term_bug \
-	unicode \
+	unicode_test \
 	unify_existq_cons \
 	unify_expression \
 	unify_typeinfo_bug \
reverted:
--- tests/hard_coded/unicode.exp	21 Jul 2006 00:50:43 -0000
+++ /dev/null	1 Jan 1970 00:00:00 -0000
@@ -1,29 +0,0 @@
-00000011
-00000011
-11001110 10010100
-11001110 10100000
-11101111 10111111 10111111
-11110100 10001111 10111111 10111111
-11110010 10101011 10110011 10011110
-01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
-01100001 01100010 01100011 00110001 00110010 00110011
-01011100 01110101 00110000 00110000 00110100 00110001
-01011100 01110101 00110000 00110000 00110100 00110001
-01011100 01000001
-01011100 01110101 00110000 00110000 00110100 00110001
-01110101 00110000 00110000 00110100 00110001
-
-
-
-Δ
-Π
-￿
-􏿿
-򫳞
-résumé
-abc123
-\u0041
-\u0041
-\A
-\u0041
-u0041
reverted:
--- tests/hard_coded/unicode.m	20 Jul 2006 07:56:50 -0000
+++ /dev/null	1 Jan 1970 00:00:00 -0000
@@ -1,50 +0,0 @@
-:- module unicode.
-
-:- interface.
-
-:- import_module io.
-
-:- pred main(io::di, io::uo) is det.
-
-:- implementation.
-
-:- import_module char.
-:- import_module list.
-:- import_module string.
-
-main(!IO) :-
-	list.foldl(write_string_as_binary, utf8_strings, !IO),
-	io.nl(!IO),
-	Str = string.join_list("\n", utf8_strings),
-	io.write_string(Str, !IO),
-	io.nl(!IO).
-
-:- func utf8_strings = list(string).
-
-utf8_strings = [
-	"\u0003",
-	"\U00000003",
-	"\u0394",  % delta
-	"\u03A0",  % pi
-	"\uFFFF",
-	"\U0010ffff",
-	"\U000ABCde",
-	"r\u00E9sum\u00E9", % "resume" with accents
-	"abc123",
-	"\u005cu0041",
-	"\x5c\u0041",
-	"\x5c\\u0041",
-	"\\u0041",
-	"u0041"
-].
-
-:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
-
-write_string_as_binary(Str, !IO) :-
-	Chars = string.to_char_list(Str),
-	Ints = list.map(to_int, Chars),
-	Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
-	PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
-	Bin = string.join_list(" ", PaddedBins),
-	io.write_string(Bin, !IO),
-	io.nl(!IO).
only in patch2:
unchanged:
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode_test.exp	21 Jul 2006 05:28:25 -0000
@@ -0,0 +1,29 @@
+00000011
+00000011
+11001110 10010100
+11001110 10100000
+11101111 10111111 10111111
+11110100 10001111 10111111 10111111
+11110010 10101011 10110011 10011110
+01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
+01100001 01100010 01100011 00110001 00110010 00110011
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01000001
+01011100 01110101 00110000 00110000 00110100 00110001
+01110101 00110000 00110000 00110100 00110001
+
+
+
+Δ
+Π
+￿
+􏿿
+򫳞
+résumé
+abc123
+\u0041
+\u0041
+\A
+\u0041
+u0041
only in patch2:
unchanged:
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode_test.m	27 Jul 2006 00:43:19 -0000
@@ -0,0 +1,50 @@
+:- module unicode_test.
+
+:- interface.
+
+:- import_module io.
+
+:- pred main(io::di, io::uo) is det.
+
+:- implementation.
+
+:- import_module char.
+:- import_module list.
+:- import_module string.
+
+main(!IO) :-
+	list.foldl(write_string_as_binary, utf8_strings, !IO),
+	io.nl(!IO),
+	Str = string.join_list("\n", utf8_strings),
+	io.write_string(Str, !IO),
+	io.nl(!IO).
+
+:- func utf8_strings = list(string).
+
+utf8_strings = [
+	"\u0003",
+	"\U00000003",
+	"\u0394",  % delta
+	"\u03A0",  % pi
+	"\uFFFF",
+	"\U0010ffff",
+	"\U000ABCde",
+	"r\u00E9sum\u00E9", % "resume" with accents
+	"abc123",
+	"\u005cu0041",
+	"\x5c\u0041",
+	"\x5c\\u0041",
+	"\\u0041",
+	"u0041"
+].
+
+:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
+
+write_string_as_binary(Str, !IO) :-
+	Chars = string.to_char_list(Str),
+	Ints = list.map(to_int, Chars),
+	Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
+	PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
+	Bin = string.join_list(" ", PaddedBins),
+	io.write_string(Bin, !IO),
+	io.nl(!IO).

--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list