[m-rev.] for review: add some unicode support to Mercury

Ian MacLarty maclarty at csse.unimelb.edu.au
Fri Jul 21 11:01:50 AEST 2006


On Wed, Jul 19, 2006 at 04:01:57PM +1000, Peter Moulder wrote:
> On Wed, Jul 05, 2006 at 12:24:25AM +1000, Ian MacLarty wrote:
> 
> > +The sequence @samp{\x} introduces
> >  a hexadecimal escape; it must be followed by a sequence of hexadecimal
> >  digits and then a closing backslash.  It is replaced
> >  with the character whose character code is identified by the hexadecimal
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> I suggest changing to `byte whose value', to clarify that e.g. \xa0\ is
> replaced by just one byte rather than being equivalent to \u00a0.
> 
> > +    % Return the number of unicode characters in a UTF-8 encoded string.
> 
> I suggest explicitly stating that the result is undefined (or
> unspecified or similar) if the given string isn't valid utf-8.  (The
> existing documentation gives the false impression that it counts only
> valid, complete unicode characters.)
> 
> > +++ tests/hard_coded/unicode.m	4 Jul 2006 10:01:15 -0000
> ...
> > +utf8_strings = [
> > +	"\u0003",
> > +	"\U00000003",
> > +	"\u0394",  % delta
> > +	"\u03A0",  % pi
> > +	"\uFFFF",
> > +	"\U0010FFFF",
> > +	"\U000ABCDE",
> > +	"r\u00E9sum\u00E9", % "resume" with accents
> > +	"abc123"
> > +].
> 
> It would be nice to add "\u005cu0041" as an example (0x5c = backslash),
> and similarly "\x5c\u0041", "\x5c\\u0041", "\\u0041" and "u0041".
> It would be good for some of these examples to use lowercase hex digits.
> 
> Otherwise looks fine to me.
> 

Here's the new diff and CVS log (the interdiff is almost as big as the
diff, so I'm just posting the diff).

I'll post the new unicode module as a separate change.

Estimated hours taken: 6
Branches: main

Add escape sequences for encoding unicode characters in Mercury string
literals.  The unicode characters are encoded in UTF-8 on the C backends and
in UTF-16 on the Java and IL backends.

NEWS:
	Mention the changes.

doc/reference_manual.texi:
	Document the new escape sequences.

	Divide the section on string tokens up into multiple paragraphs.

library/lexer.m:
	Convert unicode characters encoded with the new escape sequences to 
	UTF-8 on the C backends and UTF-16 on the Java and IL backends.

	Some of the new predicate may be moved to a "unicode" module that I'm
	intending to add to the standard library in a subsequent change.

library/list.m:
	Fix the formatting of some comments.

tests/hard_coded/Mmakefile:
tests/hard_coded/unicode.exp:
tests/hard_coded/unicode.m:
tests/invalid/Mmakefile:
tests/invalid/unicode1.exp:
tests/invalid/unicode1.m:
tests/invalid/unicode2.exp:
tests/invalid/unicode2.m:
	Test the new support for unicode.

Index: NEWS
===================================================================
RCS file: /home/mercury1/repository/mercury/NEWS,v
retrieving revision 1.415
diff -u -r1.415 NEWS
--- NEWS	20 Jul 2006 01:50:41 -0000	1.415
+++ NEWS	20 Jul 2006 08:21:32 -0000
@@ -32,6 +32,8 @@
 * ':' is now the type qualification operator, not a module qualifier.
 * To ensure soundness, goals in negated contexts using non-local variables
   with dynamic modes (inst "any") must now be marked as impure.
+* Unicode characters can now be encoded in string literals using an
+  escape sequence.
 
 Changes to the Mercury standard library:
 * We have removed the predicates dealing with runtime type information (RTTI)
@@ -292,6 +294,13 @@
   sure that the goal is in fact pure, e.g. because they know that
   the goal inside the negation will not instantiate the variable.
 
+* Unicode characters can now be encoded in string literals using an
+  escape sequence.  The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
+  (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced with
+  the corresponding unicode character.  The encoding used to represent the
+  unicode character is implementation dependent.  For the Melbourne Mercury
+  compiler unicode characters are represented using UTF-8 for the C backends.
+
 Changes to the Mercury standard library:
 
 * We have added the function `divide_equivalence_classes' to the `eqvclass'
Index: doc/reference_manual.texi
===================================================================
RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
retrieving revision 1.361
diff -u -r1.361 reference_manual.texi
--- doc/reference_manual.texi	20 Jul 2006 02:44:04 -0000	1.361
+++ doc/reference_manual.texi	20 Jul 2006 07:58:33 -0000
@@ -220,21 +220,35 @@
 
 @item string
 A string is a sequence of characters enclosed in double quotes (@code{"}).
+
 Within a string, two adjacent double quotes stand for a single double quote.
 For example, the string @samp{ """" } is a string of length one, containing
 a single double quote: the outermost pair of double quotes encloses the
 string, and the innermost pair stand for a single double quote.
+
 Strings may also contain backslash escapes.  @samp{\a} stands for ``alert''
 (a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
 @samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
 @samp{\v} for vertical-tab.  An escaped backslash, single-quote, or
-double-quote stands for itself.  The sequence @samp{\x} introduces
+double-quote stands for itself.
+
+The sequence @samp{\x} introduces
 a hexadecimal escape; it must be followed by a sequence of hexadecimal
 digits and then a closing backslash.  It is replaced
 with the character whose character code is identified by the hexadecimal
 number.  Similarly, a backslash followed by an octal digit is the
 beginning of an octal escape; as with hexadecimal escapes, the sequence
 of octal digits must be terminated with a closing backslash.
+
+The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
+ at samp{\u} must be followed by the unicode character code expressed as four
+hexidecimal digits.
+ at samp{\U} must be followed by the unicode character code expressed as eight
+hexidecimal digits.
+The encoding used for unicode characters is implementation dependent.
+For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
+for the Java and IL backends.
+
 A backslash followed immediately by a newline is deleted; thus an
 escaped newline can be used to continue a string over more than one
 source line.  (String literals may also contain embedded newlines.)
Index: library/lexer.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
retrieving revision 1.46
diff -u -r1.46 lexer.m
--- library/lexer.m	19 Apr 2006 05:17:53 -0000	1.46
+++ library/lexer.m	20 Jul 2006 07:13:33 -0000
@@ -895,6 +895,10 @@
             get_quoted_name(QuoteChar, Chars1, Token, !IO)
         ; Char = 'x' ->
             get_hex_escape(QuoteChar, Chars, [], Token, !IO)
+        ; Char = 'u' ->
+            get_unicode_escape(4, QuoteChar, Chars, [], Token, !IO)
+        ; Char = 'U' ->
+            get_unicode_escape(8, QuoteChar, Chars, [], Token, !IO)
         ; char.is_octal_digit(Char) ->
             get_octal_escape(QuoteChar, Chars, [Char], Token, !IO)
         ;
@@ -923,6 +927,12 @@
         ; Char = 'x' ->
             string_get_hex_escape(String, Len, QuoteChar, Chars, [],
                 Posn0, Token, Context, !Posn)
+        ; Char = 'u' ->
+            string_get_unicode_escape(4, String, Len, QuoteChar, Chars,
+                    [], Posn0, Token, Context, !Posn)
+        ; Char = 'U' ->
+            string_get_unicode_escape(8, String, Len, QuoteChar, Chars,
+                    [], Posn0, Token, Context, !Posn)
         ; char.is_octal_digit(Char) ->
             string_get_octal_escape(String, Len, QuoteChar, Chars, [Char],
                 Posn0, Token, Context, !Posn)
@@ -971,6 +981,195 @@
         )
     ).
 
+:- pred get_unicode_escape(int::in, char::in, list(char)::in, list(char)::in,
+    token::out, io::di, io::uo) is det.
+
+get_unicode_escape(NumHexChars, QuoteChar, Chars, HexChars, Token, !IO) :-
+    ( if NumHexChars = list.length(HexChars) then
+        rev_char_list_to_string(HexChars, HexString),
+        ( if
+            string.base_string_to_int(16, HexString, UnicodeCharCode),
+            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+        then
+            get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
+                Token, !IO)
+        else
+            Token = error("invalid unicode character code")
+        )
+    else
+        io.read_char(Result, !IO),
+        (
+            Result = error(Error),
+            Token = io_error(Error)
+        ;
+            Result = eof,
+            Token = error("unterminated quote")
+        ;
+            Result = ok(Char),
+            ( if char.is_hex_digit(Char) then
+                get_unicode_escape(NumHexChars, QuoteChar, Chars,
+                    [Char | HexChars], Token, !IO)
+            else
+                Token = error("invalid hex character in unicode escape")
+            )
+        )
+    ).
+
+:- pred string_get_unicode_escape(int::in, string::in, int::in, char::in,
+    list(char)::in, list(char)::in, posn::in, token::out,
+    string_token_context::out, posn::in, posn::out) is det.
+
+string_get_unicode_escape(NumHexChars, String, Len, QuoteChar, Chars,
+        HexChars, Posn0, Token, Context, !Posn) :-
+    ( if NumHexChars = list.length(HexChars) then
+        rev_char_list_to_string(HexChars, HexString),
+        ( if
+            string.base_string_to_int(16, HexString, UnicodeCharCode),
+            convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+        then
+            RevCharsWithUnicode = list.reverse(UTF8Chars) ++ Chars,
+            string_get_quoted_name(String, Len, QuoteChar, RevCharsWithUnicode,
+                Posn0, Token, Context, !Posn)
+        else
+            string_get_context(Posn0, Context, !Posn),
+            Token = error("invalid unicode character code")
+        )
+    else
+        ( if string_read_char(String, Len, Char, !Posn) then
+            ( if char.is_hex_digit(Char) then
+                string_get_unicode_escape(NumHexChars, String, Len, QuoteChar,
+                    Chars, [Char | HexChars], Posn0, Token, Context, !Posn)
+            else
+                string_get_context(Posn0, Context, !Posn),
+                Token = error("invalid hex character in unicode escape")
+            )
+        else
+            string_get_context(Posn0, Context, !Posn),
+            Token = error("unterminated quote")
+        )
+    ).
+
+:- pred convert_unicode_char_to_target_chars(int::in, list(char)::out)
+    is semidet.
+
+convert_unicode_char_to_target_chars(UnicodeCharCode, Chars) :-
+    BackendEncoding = backend_unicode_encoding,
+    (
+        BackendEncoding = utf8,
+        encode_unicode_char_as_utf8(UnicodeCharCode, Chars)
+    ;
+        BackendEncoding = utf16,
+        encode_unicode_char_as_utf16(UnicodeCharCode, Chars)
+    ).
+
+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
+
+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
+    allowed_unicode_char_code(UnicodeCharCode),
+    ( if UnicodeCharCode =< 0x00007F then
+        UTF8Chars = [char.det_from_int(UnicodeCharCode)]
+    else if UnicodeCharCode =< 0x0007FF then
+        Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
+        Part2 =  0b00000111111 /\ UnicodeCharCode,
+        char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+        UTF8Chars = [UTF8Char1, UTF8Char2]
+    else if UnicodeCharCode =< 0x00FFFF then
+        Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
+        Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
+        Part3 =  0b0000000000111111 /\ UnicodeCharCode,
+        char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
+    else
+        Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
+        Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
+        Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
+        Part4 =  0b000000000000000111111 /\ UnicodeCharCode,
+        char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+        char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
+        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
+    ).
+
+:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
+
+    % This predicate should only be called on backends that have
+    % a 16 bit character type.
+    %
+encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
+    allowed_unicode_char_code(UnicodeCharCode),
+    ( if UnicodeCharCode =< 0xFFFF then
+        char.det_from_int(UnicodeCharCode, Char),
+        UTF16Chars = [Char]
+    else
+        LeadOffset = 0xDB00 - (0x10000 >> 10),
+        Lead = LeadOffset + (UnicodeCharCode >> 10),
+        Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
+        char.det_from_int(Lead, LeadChar),
+        char.det_from_int(Trail, TrailChar),
+        UTF16Chars = [LeadChar, TrailChar]
+    ).
+
+:- pred allowed_unicode_char_code(int::in) is semidet.
+
+allowed_unicode_char_code(Code) :-
+    Code >= 0,
+    % The following ranges are reserved for utf-16 surrogates.
+    not (
+        Code >= 0xD800, Code =< 0xDBFF
+    ;
+        Code >= 0xDC00, Code =< 0xDFFF
+    ),
+    Code =< 0x10FFFF.
+
+:- type unicode_encoding
+    --->    utf8
+    ;       utf16.
+
+:- func backend_unicode_encoding = unicode_encoding.
+
+backend_unicode_encoding = Encoding :-
+    Int = backend_unicode_encoding_int,
+    ( unicode_encoding_int_to_encoding(Int, Encoding0) ->
+        Encoding = Encoding0
+    ;
+        error("backend_unicode_encoding: unexpected unicode encoding code")
+    ).
+
+:- pred unicode_encoding_int_to_encoding(int::in, unicode_encoding::out)
+    is semidet.
+
+unicode_encoding_int_to_encoding(0, utf8).
+unicode_encoding_int_to_encoding(1, utf16).
+
+:- func backend_unicode_encoding_int = int.
+
+:- pragma inline(backend_unicode_encoding_int/0).
+
+:- pragma foreign_proc("C",
+    backend_unicode_encoding_int = (EncodingInt::out),
+    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+    EncodingInt = 0;
+").
+
+:- pragma foreign_proc("Java",
+    backend_unicode_encoding_int = (EncodingInt::out),
+    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+    EncodingInt = 1;
+").
+
+:- pragma foreign_proc("Java",
+    backend_unicode_encoding_int = (EncodingInt::out),
+    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+    EncodingInt = 1;
+").
+
 :- pred string_get_hex_escape(string::in, int::in, char::in,
     list(char)::in, list(char)::in, posn::in, token::out,
     string_token_context::out, posn::in, posn::out) is det.
Index: library/list.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/list.m,v
retrieving revision 1.150
diff -u -r1.150 list.m
--- library/list.m	15 May 2006 09:12:05 -0000	1.150
+++ library/list.m	4 Jul 2006 13:35:59 -0000
@@ -1021,14 +1021,17 @@
     % Same as list.filter_map/3 except that it only returns the first
     % match:
     %   find_first_map(X, Y, Z) <=> list.filter_map(X, Y, [Z | _])
+    %
 :- pred list.find_first_map(pred(X, Y)::in(pred(in, out) is semidet),
         list(X)::in, Y::out) is semidet.
 
     % Same as list.find_first_map, except with two outputs.
+    %
 :- pred list.find_first_map2(pred(X, A, B)::in(pred(in, out, out) is semidet),
     list(X)::in, A::out, B::out) is semidet.
 
     % Same as list.find_first_map, except with three outputs.
+    %
 :- pred list.find_first_map3(
     pred(X, A, B, C)::in(pred(in, out, out, out) is semidet),
     list(X)::in, A::out, B::out, C::out) is semidet.
Index: tests/hard_coded/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
retrieving revision 1.289
diff -u -r1.289 Mmakefile
--- tests/hard_coded/Mmakefile	19 Jul 2006 15:19:07 -0000	1.289
+++ tests/hard_coded/Mmakefile	20 Jul 2006 07:58:41 -0000
@@ -214,6 +214,7 @@
 	type_spec_ho_term \
 	type_spec_modes \
 	type_to_term_bug \
+	unicode \
 	unify_existq_cons \
 	unify_expression \
 	unify_typeinfo_bug \
Index: tests/hard_coded/unicode.exp
===================================================================
RCS file: tests/hard_coded/unicode.exp
diff -N tests/hard_coded/unicode.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.exp	21 Jul 2006 00:50:43 -0000
@@ -0,0 +1,29 @@
+00000011
+00000011
+11001110 10010100
+11001110 10100000
+11101111 10111111 10111111
+11110100 10001111 10111111 10111111
+11110010 10101011 10110011 10011110
+01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
+01100001 01100010 01100011 00110001 00110010 00110011
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01000001
+01011100 01110101 00110000 00110000 00110100 00110001
+01110101 00110000 00110000 00110100 00110001
+
+
+
+Δ
+Π
+￿
+􏿿
+򫳞
+résumé
+abc123
+\u0041
+\u0041
+\A
+\u0041
+u0041
Index: tests/hard_coded/unicode.m
===================================================================
RCS file: tests/hard_coded/unicode.m
diff -N tests/hard_coded/unicode.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.m	20 Jul 2006 07:56:50 -0000
@@ -0,0 +1,50 @@
+:- module unicode.
+
+:- interface.
+
+:- import_module io.
+
+:- pred main(io::di, io::uo) is det.
+
+:- implementation.
+
+:- import_module char.
+:- import_module list.
+:- import_module string.
+
+main(!IO) :-
+	list.foldl(write_string_as_binary, utf8_strings, !IO),
+	io.nl(!IO),
+	Str = string.join_list("\n", utf8_strings),
+	io.write_string(Str, !IO),
+	io.nl(!IO).
+
+:- func utf8_strings = list(string).
+
+utf8_strings = [
+	"\u0003",
+	"\U00000003",
+	"\u0394",  % delta
+	"\u03A0",  % pi
+	"\uFFFF",
+	"\U0010ffff",
+	"\U000ABCde",
+	"r\u00E9sum\u00E9", % "resume" with accents
+	"abc123",
+	"\u005cu0041",
+	"\x5c\u0041",
+	"\x5c\\u0041",
+	"\\u0041",
+	"u0041"
+].
+
+:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
+
+write_string_as_binary(Str, !IO) :-
+	Chars = string.to_char_list(Str),
+	Ints = list.map(to_int, Chars),
+	Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
+	PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
+	Bin = string.join_list(" ", PaddedBins),
+	io.write_string(Bin, !IO),
+	io.nl(!IO).
Index: tests/invalid/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/invalid/Mmakefile,v
retrieving revision 1.195
diff -u -r1.195 Mmakefile
--- tests/invalid/Mmakefile	12 Jul 2006 02:51:22 -0000	1.195
+++ tests/invalid/Mmakefile	20 Jul 2006 07:58:43 -0000
@@ -199,6 +199,8 @@
 	undef_symbol \
 	undef_type \
 	undef_type_mod_qual \
+	unicode1 \
+	unicode2 \
 	unify_mode_error \
 	uniq_modes \
 	uniq_neg \
Index: tests/invalid/unicode1.exp
===================================================================
RCS file: tests/invalid/unicode1.exp
diff -N tests/invalid/unicode1.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.exp	4 Jul 2006 10:07:16 -0000
@@ -0,0 +1,5 @@
+unicode1.m:005: Syntax error: invalid hex character in unicode escape.
+unicode1.m:005: Syntax error: unterminated quote.
+unicode1.m:001: Warning: interface for module `unicode1' does not export
+unicode1.m:001:   anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode1.m
===================================================================
RCS file: tests/invalid/unicode1.m
diff -N tests/invalid/unicode1.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.m	4 Jul 2006 10:07:06 -0000
@@ -0,0 +1,5 @@
+:- module unicode1.
+
+:- interface.
+
+:- type '\uabxy' == int.
Index: tests/invalid/unicode2.exp
===================================================================
RCS file: tests/invalid/unicode2.exp
diff -N tests/invalid/unicode2.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.exp	4 Jul 2006 12:30:52 -0000
@@ -0,0 +1,5 @@
+unicode2.m:005: Syntax error: invalid unicode character code.
+unicode2.m:005: Syntax error: unterminated quote.
+unicode2.m:001: Warning: interface for module `unicode2' does not export
+unicode2.m:001:   anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode2.m
===================================================================
RCS file: tests/invalid/unicode2.m
diff -N tests/invalid/unicode2.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.m	4 Jul 2006 12:27:46 -0000
@@ -0,0 +1,5 @@
+:- module unicode2.
+
+:- interface.
+
+:- type '\UFFFFFFFF' == int.

--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list