[m-rev.] for review: add some unicode support to Mercury

Ian MacLarty maclarty at cs.mu.OZ.AU
Wed Jul 5 00:24:25 AEST 2006


For review by anyone.

Estimated hours taken: 3
Branches: main

Add escape sequences for encoding unicode characters in Mercury string
literals.  The unicode characters are encoded in UTF-8.

Add the function utf8_length to the string module to determine the number
of unicode characters in a string.

NEWS:
	Mention the changes.

doc/reference_manual.texi:
	Document the new escape sequences.

	Divide the section on string tokens up into multiple paragraphs.

library/lexer.m:
	Convert unicode characters encoded with the new escape sequences to 
	UTF-8.

library/list.m:
	Fix the formatting of some comments.

library/string.m:
	Add the function utf8_length that returns the number of unicode
	characters in a UTF-8 encoded string.

	Point out that string.length does not return the number of unicode
	characters for UTF-8 encoded strings, but instead returns the number
	of bytes used in the encoding.

tests/hard_coded/Mmakefile:
tests/hard_coded/unicode.exp:
tests/hard_coded/unicode.m:
tests/invalid/Mmakefile:
tests/invalid/unicode1.exp:
tests/invalid/unicode1.m:
tests/invalid/unicode2.exp:
tests/invalid/unicode2.m:
	Test the new support for unicode.

Index: NEWS
===================================================================
RCS file: /home/mercury1/repository/mercury/NEWS,v
retrieving revision 1.414
diff -u -r1.414 NEWS
--- NEWS	13 Jun 2006 09:48:58 -0000	1.414
+++ NEWS	4 Jul 2006 08:21:17 -0000
@@ -32,6 +32,8 @@
 * ':' is now the type qualification operator, not a module qualifier.
 * To ensure soundness, goals in negated contexts using non-local variables
   with dynamic modes (inst "any") must now be marked as impure.
+* Unicode characters can now be encoded in string literals using an
+  escape sequence.
 
 Changes to the Mercury standard library:
 * We have removed the predicates dealing with runtime type information (RTTI)
@@ -46,6 +48,10 @@
 * We have made the predicates semidet_succeed/0, semidet_fail/0 and
   cc_multi_equal/2 into builtins.  Formerly these were exported by std_util.m.
 * We have added an `injection' module, for reversible maps that are injective.
+* We have added string.utf8_length/1 for determining the number of unicode
+  characters in a UTF-8 encoded string.
+* We have added list.count/2 for counting the number of elements in a list
+  for which a given predicate are true.
 
 Changes to the Mercury compiler:
 * The compiler now generates error messages for mismatches between format
@@ -290,6 +296,12 @@
   sure that the goal is in fact pure, e.g. because they know that
   the goal inside the negation will not instantiate the variable.
 
+* Unicode characters can now be encoded in string literals using an
+  escape sequence.  The escape sequence \uXXXX, where XXXX is a unicode
+  character code in hexidecimal, is replaced with the utf-8 encoding of the
+  unicode character.  Similarly the sequence \UXXXXXXXXX can be used to
+  encode unicode characters whose codes are greater than FFFF (in hex).
+
 Changes to the Mercury standard library:
 
 * We have added the function `divide_equivalence_classes' to the `eqvclass'
@@ -309,6 +321,12 @@
 
 * We have added impure_true/0 and semipure_true/0.
 
+* We have added string.utf8_length/1 for determining the number of unicode
+  characters in a UTF-8 encoded string.
+
+* We have added list.count/2 for counting the number of elements in a list
+  for which a given predicate are true.
+
 Changes to the Mercury compiler:
 
 * The compiler now generates error messages for known mismatches between format
Index: doc/reference_manual.texi
===================================================================
RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
retrieving revision 1.356
diff -u -r1.356 reference_manual.texi
--- doc/reference_manual.texi	23 Jun 2006 07:53:09 -0000	1.356
+++ doc/reference_manual.texi	4 Jul 2006 07:34:24 -0000
@@ -220,21 +220,34 @@
 
 @item string
 A string is a sequence of characters enclosed in double quotes (@code{"}).
+
 Within a string, two adjacent double quotes stand for a single double quote.
 For example, the string @samp{ """" } is a string of length one, containing
 a single double quote: the outermost pair of double quotes encloses the
 string, and the innermost pair stand for a single double quote.
+
 Strings may also contain backslash escapes.  @samp{\a} stands for ``alert''
 (a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
 @samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
 @samp{\v} for vertical-tab.  An escaped backslash, single-quote, or
-double-quote stands for itself.  The sequence @samp{\x} introduces
+double-quote stands for itself.
+
+The sequence @samp{\x} introduces
 a hexadecimal escape; it must be followed by a sequence of hexadecimal
 digits and then a closing backslash.  It is replaced
 with the character whose character code is identified by the hexadecimal
 number.  Similarly, a backslash followed by an octal digit is the
 beginning of an octal escape; as with hexadecimal escapes, the sequence
 of octal digits must be terminated with a closing backslash.
+
+The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
+ at samp{\u} must be followed by the unicode character code expressed as four
+hexidecimal digits.
+ at samp{\U} must be followed by the unicode character code expressed as eight
+hexidecimal digits.
+A @samp{\uXXXX} (or @samp{\UXXXXXXXX}) sequence is replace by the utf-8
+encoding of the unicode character given by @samp{XXXX} (or @samp{XXXXXXXX}).
+
 A backslash followed immediately by a newline is deleted; thus an
 escaped newline can be used to continue a string over more than one
 source line.  (String literals may also contain embedded newlines.)
Index: library/lexer.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
retrieving revision 1.46
diff -u -r1.46 lexer.m
--- library/lexer.m	19 Apr 2006 05:17:53 -0000	1.46
+++ library/lexer.m	4 Jul 2006 10:45:38 -0000
@@ -895,6 +895,10 @@
             get_quoted_name(QuoteChar, Chars1, Token, !IO)
         ; Char = 'x' ->
             get_hex_escape(QuoteChar, Chars, [], Token, !IO)
+        ; Char = 'u' ->
+            get_unicode_escape(4, QuoteChar, Chars, [], Token, !IO)
+        ; Char = 'U' ->
+            get_unicode_escape(8, QuoteChar, Chars, [], Token, !IO)
         ; char.is_octal_digit(Char) ->
             get_octal_escape(QuoteChar, Chars, [Char], Token, !IO)
         ;
@@ -923,6 +927,12 @@
         ; Char = 'x' ->
             string_get_hex_escape(String, Len, QuoteChar, Chars, [],
                 Posn0, Token, Context, !Posn)
+        ; Char = 'u' ->
+            string_get_unicode_escape(4, String, Len, QuoteChar, Chars,
+                    [], Posn0, Token, Context, !Posn)
+        ; Char = 'U' ->
+            string_get_unicode_escape(8, String, Len, QuoteChar, Chars,
+                    [], Posn0, Token, Context, !Posn)
         ; char.is_octal_digit(Char) ->
             string_get_octal_escape(String, Len, QuoteChar, Chars, [Char],
                 Posn0, Token, Context, !Posn)
@@ -971,6 +981,111 @@
         )
     ).
 
+:- pred get_unicode_escape(int::in, char::in, list(char)::in, list(char)::in,
+    token::out, io::di, io::uo) is det.
+
+get_unicode_escape(NumHexChars, QuoteChar, Chars, HexChars, Token, !IO) :-
+    ( if NumHexChars = list.length(HexChars) then
+        rev_char_list_to_string(HexChars, HexString),
+        ( if
+            string.base_string_to_int(16, HexString, UnicodeCharCode),
+            encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars)
+        then
+            get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
+                Token, !IO)
+        else
+            Token = error("invalid unicode character code")
+        )
+    else
+        io.read_char(Result, !IO),
+        (
+            Result = error(Error),
+            Token = io_error(Error)
+        ;
+            Result = eof,
+            Token = error("unterminated quote")
+        ;
+            Result = ok(Char),
+            ( if char.is_hex_digit(Char) then
+                get_unicode_escape(NumHexChars, QuoteChar, Chars,
+                    [Char | HexChars], Token, !IO)
+            else
+                Token = error("invalid hex character in unicode escape")
+            )
+        )
+    ).
+
+:- pred string_get_unicode_escape(int::in, string::in, int::in, char::in,
+    list(char)::in, list(char)::in, posn::in, token::out,
+    string_token_context::out, posn::in, posn::out) is det.
+
+string_get_unicode_escape(NumHexChars, String, Len, QuoteChar, Chars,
+        HexChars, Posn0, Token, Context, !Posn) :-
+    ( if NumHexChars = list.length(HexChars) then
+        rev_char_list_to_string(HexChars, HexString),
+        ( if
+            string.base_string_to_int(16, HexString, UnicodeCharCode),
+            encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars)
+        then
+            RevCharsWithUnicode = list.reverse(UTF8Chars) ++ Chars,
+            string_get_quoted_name(String, Len, QuoteChar, RevCharsWithUnicode,
+                Posn0, Token, Context, !Posn)
+        else
+            string_get_context(Posn0, Context, !Posn),
+            Token = error("invalid unicode character code")
+        )
+    else
+        ( if string_read_char(String, Len, Char, !Posn) then
+            ( if char.is_hex_digit(Char) then
+                string_get_unicode_escape(NumHexChars, String, Len, QuoteChar,
+                    Chars, [Char | HexChars], Posn0, Token, Context, !Posn)
+            else
+                string_get_context(Posn0, Context, !Posn),
+                Token = error("invalid hex character in unicode escape")
+            )
+        else
+            string_get_context(Posn0, Context, !Posn),
+            Token = error("unterminated quote")
+        )
+    ).
+
+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
+
+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
+    ( if UnicodeCharCode >= 0 then
+        ( if UnicodeCharCode =< 0x00007F then
+            UTF8Chars = [char.det_from_int(UnicodeCharCode)]
+        else if UnicodeCharCode =< 0x0007FF then
+            Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
+            Part2 = 0b00000111111 /\ UnicodeCharCode,
+            char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
+            char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+            UTF8Chars = [UTF8Char1, UTF8Char2]
+        else if UnicodeCharCode =< 0x00FFFF then
+            Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
+            Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
+            Part3 = 0b0000000000111111 /\ UnicodeCharCode,
+            char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
+            char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+            char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+            UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
+        else if UnicodeCharCode =< 0x10FFFF then
+            Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
+            Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
+            Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
+            Part4 =  0b000000000000000111111 /\ UnicodeCharCode,
+            char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
+            char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+            char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+            char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
+            UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
+        else
+            fail
+        )
+    else
+        fail
+    ).
+
 :- pred string_get_hex_escape(string::in, int::in, char::in,
     list(char)::in, list(char)::in, posn::in, token::out,
     string_token_context::out, posn::in, posn::out) is det.
Index: library/list.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/list.m,v
retrieving revision 1.150
diff -u -r1.150 list.m
--- library/list.m	15 May 2006 09:12:05 -0000	1.150
+++ library/list.m	4 Jul 2006 13:35:59 -0000
@@ -1021,14 +1021,17 @@
     % Same as list.filter_map/3 except that it only returns the first
     % match:
     %   find_first_map(X, Y, Z) <=> list.filter_map(X, Y, [Z | _])
+    %
 :- pred list.find_first_map(pred(X, Y)::in(pred(in, out) is semidet),
         list(X)::in, Y::out) is semidet.
 
     % Same as list.find_first_map, except with two outputs.
+    %
 :- pred list.find_first_map2(pred(X, A, B)::in(pred(in, out, out) is semidet),
     list(X)::in, A::out, B::out) is semidet.
 
     % Same as list.find_first_map, except with three outputs.
+    %
 :- pred list.find_first_map3(
     pred(X, A, B, C)::in(pred(in, out, out, out) is semidet),
     list(X)::in, A::out, B::out, C::out) is semidet.
Index: library/string.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/string.m,v
retrieving revision 1.245
diff -u -r1.245 string.m
--- library/string.m	19 Apr 2006 05:17:57 -0000	1.245
+++ library/string.m	4 Jul 2006 14:16:30 -0000
@@ -33,12 +33,21 @@
     % Determine the length of a string.
     % An empty string has length zero.
     %
+    % NOTE: for UTF-8 encoded strings, the length function returns the number
+    % of bytes in the UTF-8 representation of the string, not the number of
+    % unicode characters in the string.  Use utf8_length/1 to find the number
+    % of unicode characters in the string.
+    %
 :- func string.length(string) = int.
 :- mode string.length(in) = uo is det.
 :- pred string.length(string, int).
 :- mode string.length(in, uo) is det.
 :- mode string.length(ui, uo) is det.
 
+    % Return the number of unicode characters in a UTF-8 encoded string.
+    %
+:- func string.utf8_length(string) = int.
+
     % Append two strings together.
     %
 :- func string.append(string, string) = string.
@@ -3472,6 +3481,24 @@
         Length = Index
     ).
 
+string.utf8_length(Str) = Length :-
+    string.foldl(add_one_if_first_byte_in_utf8_char, Str, 0, Length).
+
+:- pred add_one_if_first_byte_in_utf8_char(char::in, int::in, int::out)
+    is det.
+
+add_one_if_first_byte_in_utf8_char(Char, !Acc) :-
+    Int = char.to_int(Char),
+    ( if
+        ( Int =< 0x7F
+        ; (Int /\ 0b11000000) = 0b11000000
+        )
+    then
+        !:Acc = !.Acc + 1
+    else
+        true
+    ).
+
 /*-----------------------------------------------------------------------*/
 
 :- pragma promise_pure(string.append/3).
Index: tests/hard_coded/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
retrieving revision 1.288
diff -u -r1.288 Mmakefile
--- tests/hard_coded/Mmakefile	30 Jun 2006 12:51:55 -0000	1.288
+++ tests/hard_coded/Mmakefile	4 Jul 2006 10:04:30 -0000
@@ -213,6 +213,7 @@
 	type_spec_ho_term \
 	type_spec_modes \
 	type_to_term_bug \
+	unicode \
 	unify_existq_cons \
 	unify_expression \
 	unify_typeinfo_bug \
Index: tests/hard_coded/unicode.exp
===================================================================
RCS file: tests/hard_coded/unicode.exp
diff -N tests/hard_coded/unicode.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.exp	4 Jul 2006 10:01:52 -0000
@@ -0,0 +1,11 @@
+00000011
+00000011
+11001110 10010100
+11001110 10100000
+11101111 10111111 10111111
+11110100 10001111 10111111 10111111
+11110010 10101011 10110011 10011110
+01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
+01100001 01100010 01100011 00110001 00110010 00110011
+
+[1, 1, 1, 1, 1, 1, 1, 6, 6]
Index: tests/hard_coded/unicode.m
===================================================================
RCS file: tests/hard_coded/unicode.m
diff -N tests/hard_coded/unicode.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.m	4 Jul 2006 10:01:15 -0000
@@ -0,0 +1,45 @@
+:- module unicode.
+
+:- interface.
+
+:- import_module io.
+
+:- pred main(io::di, io::uo) is det.
+
+:- implementation.
+
+:- import_module char.
+:- import_module list.
+:- import_module string.
+
+main(!IO) :-
+	list.foldl(write_string_as_binary, utf8_strings, !IO),
+	io.nl(!IO),
+	Lengths = list.map(string.utf8_length, utf8_strings),
+	io.write(Lengths, !IO),
+	io.nl(!IO).
+
+:- func utf8_strings = list(string).
+
+utf8_strings = [
+	"\u0003",
+	"\U00000003",
+	"\u0394",  % delta
+	"\u03A0",  % pi
+	"\uFFFF",
+	"\U0010FFFF",
+	"\U000ABCDE",
+	"r\u00E9sum\u00E9", % "resume" with accents
+	"abc123"
+].
+
+:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
+
+write_string_as_binary(Str, !IO) :-
+	Chars = string.to_char_list(Str),
+	Ints = list.map(to_int, Chars),
+	Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
+	PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
+	Bin = string.join_list(" ", PaddedBins),
+	io.write_string(Bin, !IO),
+	io.nl(!IO).
Index: tests/invalid/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/invalid/Mmakefile,v
retrieving revision 1.193
diff -u -r1.193 Mmakefile
--- tests/invalid/Mmakefile	16 Jun 2006 07:24:41 -0000	1.193
+++ tests/invalid/Mmakefile	4 Jul 2006 12:53:42 -0000
@@ -197,6 +197,8 @@
 	undef_symbol \
 	undef_type \
 	undef_type_mod_qual \
+	unicode1 \
+	unicode2 \
 	unify_mode_error \
 	uniq_modes \
 	uniq_neg \
Index: tests/invalid/unicode1.exp
===================================================================
RCS file: tests/invalid/unicode1.exp
diff -N tests/invalid/unicode1.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.exp	4 Jul 2006 10:07:16 -0000
@@ -0,0 +1,5 @@
+unicode1.m:005: Syntax error: invalid hex character in unicode escape.
+unicode1.m:005: Syntax error: unterminated quote.
+unicode1.m:001: Warning: interface for module `unicode1' does not export
+unicode1.m:001:   anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode1.m
===================================================================
RCS file: tests/invalid/unicode1.m
diff -N tests/invalid/unicode1.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.m	4 Jul 2006 10:07:06 -0000
@@ -0,0 +1,5 @@
+:- module unicode1.
+
+:- interface.
+
+:- type '\uabxy' == int.
Index: tests/invalid/unicode2.exp
===================================================================
RCS file: tests/invalid/unicode2.exp
diff -N tests/invalid/unicode2.exp
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.exp	4 Jul 2006 12:30:52 -0000
@@ -0,0 +1,5 @@
+unicode2.m:005: Syntax error: invalid unicode character code.
+unicode2.m:005: Syntax error: unterminated quote.
+unicode2.m:001: Warning: interface for module `unicode2' does not export
+unicode2.m:001:   anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode2.m
===================================================================
RCS file: tests/invalid/unicode2.m
diff -N tests/invalid/unicode2.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.m	4 Jul 2006 12:27:46 -0000
@@ -0,0 +1,5 @@
+:- module unicode2.
+
+:- interface.
+
+:- type '\UFFFFFFFF' == int.
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list