[m-rev.] for review: add some unicode support to Mercury
Ian MacLarty
maclarty at cs.mu.OZ.AU
Wed Jul 5 00:24:25 AEST 2006
For review by anyone.
Estimated hours taken: 3
Branches: main
Add escape sequences for encoding unicode characters in Mercury string
literals. The unicode characters are encoded in UTF-8.
Add the function utf8_length to the string module to determine the number
of unicode characters in a string.
NEWS:
Mention the changes.
doc/reference_manual.texi:
Document the new escape sequences.
Divide the section on string tokens up into multiple paragraphs.
library/lexer.m:
Convert unicode characters encoded with the new escape sequences to
UTF-8.
library/list.m:
Fix the formatting of some comments.
library/string.m:
Add the function utf8_length that returns the number of unicode
characters in a UTF-8 encoded string.
Point out that string.length does not return the number of unicode
characters for UTF-8 encoded strings, but instead returns the number
of bytes used in the encoding.
tests/hard_coded/Mmakefile:
tests/hard_coded/unicode.exp:
tests/hard_coded/unicode.m:
tests/invalid/Mmakefile:
tests/invalid/unicode1.exp:
tests/invalid/unicode1.m:
tests/invalid/unicode2.exp:
tests/invalid/unicode2.m:
Test the new support for unicode.
Index: NEWS
===================================================================
RCS file: /home/mercury1/repository/mercury/NEWS,v
retrieving revision 1.414
diff -u -r1.414 NEWS
--- NEWS 13 Jun 2006 09:48:58 -0000 1.414
+++ NEWS 4 Jul 2006 08:21:17 -0000
@@ -32,6 +32,8 @@
* ':' is now the type qualification operator, not a module qualifier.
* To ensure soundness, goals in negated contexts using non-local variables
with dynamic modes (inst "any") must now be marked as impure.
+* Unicode characters can now be encoded in string literals using an
+ escape sequence.
Changes to the Mercury standard library:
* We have removed the predicates dealing with runtime type information (RTTI)
@@ -46,6 +48,10 @@
* We have made the predicates semidet_succeed/0, semidet_fail/0 and
cc_multi_equal/2 into builtins. Formerly these were exported by std_util.m.
* We have added an `injection' module, for reversible maps that are injective.
+* We have added string.utf8_length/1 for determining the number of unicode
+ characters in a UTF-8 encoded string.
+* We have added list.count/2 for counting the number of elements in a list
+ for which a given predicate are true.
Changes to the Mercury compiler:
* The compiler now generates error messages for mismatches between format
@@ -290,6 +296,12 @@
sure that the goal is in fact pure, e.g. because they know that
the goal inside the negation will not instantiate the variable.
+* Unicode characters can now be encoded in string literals using an
+ escape sequence. The escape sequence \uXXXX, where XXXX is a unicode
+ character code in hexidecimal, is replaced with the utf-8 encoding of the
+ unicode character. Similarly the sequence \UXXXXXXXXX can be used to
+ encode unicode characters whose codes are greater than FFFF (in hex).
+
Changes to the Mercury standard library:
* We have added the function `divide_equivalence_classes' to the `eqvclass'
@@ -309,6 +321,12 @@
* We have added impure_true/0 and semipure_true/0.
+* We have added string.utf8_length/1 for determining the number of unicode
+ characters in a UTF-8 encoded string.
+
+* We have added list.count/2 for counting the number of elements in a list
+ for which a given predicate are true.
+
Changes to the Mercury compiler:
* The compiler now generates error messages for known mismatches between format
Index: doc/reference_manual.texi
===================================================================
RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
retrieving revision 1.356
diff -u -r1.356 reference_manual.texi
--- doc/reference_manual.texi 23 Jun 2006 07:53:09 -0000 1.356
+++ doc/reference_manual.texi 4 Jul 2006 07:34:24 -0000
@@ -220,21 +220,34 @@
@item string
A string is a sequence of characters enclosed in double quotes (@code{"}).
+
Within a string, two adjacent double quotes stand for a single double quote.
For example, the string @samp{ """" } is a string of length one, containing
a single double quote: the outermost pair of double quotes encloses the
string, and the innermost pair stand for a single double quote.
+
Strings may also contain backslash escapes. @samp{\a} stands for ``alert''
(a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
@samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
@samp{\v} for vertical-tab. An escaped backslash, single-quote, or
-double-quote stands for itself. The sequence @samp{\x} introduces
+double-quote stands for itself.
+
+The sequence @samp{\x} introduces
a hexadecimal escape; it must be followed by a sequence of hexadecimal
digits and then a closing backslash. It is replaced
with the character whose character code is identified by the hexadecimal
number. Similarly, a backslash followed by an octal digit is the
beginning of an octal escape; as with hexadecimal escapes, the sequence
of octal digits must be terminated with a closing backslash.
+
+The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
+ at samp{\u} must be followed by the unicode character code expressed as four
+hexidecimal digits.
+ at samp{\U} must be followed by the unicode character code expressed as eight
+hexidecimal digits.
+A @samp{\uXXXX} (or @samp{\UXXXXXXXX}) sequence is replace by the utf-8
+encoding of the unicode character given by @samp{XXXX} (or @samp{XXXXXXXX}).
+
A backslash followed immediately by a newline is deleted; thus an
escaped newline can be used to continue a string over more than one
source line. (String literals may also contain embedded newlines.)
Index: library/lexer.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
retrieving revision 1.46
diff -u -r1.46 lexer.m
--- library/lexer.m 19 Apr 2006 05:17:53 -0000 1.46
+++ library/lexer.m 4 Jul 2006 10:45:38 -0000
@@ -895,6 +895,10 @@
get_quoted_name(QuoteChar, Chars1, Token, !IO)
; Char = 'x' ->
get_hex_escape(QuoteChar, Chars, [], Token, !IO)
+ ; Char = 'u' ->
+ get_unicode_escape(4, QuoteChar, Chars, [], Token, !IO)
+ ; Char = 'U' ->
+ get_unicode_escape(8, QuoteChar, Chars, [], Token, !IO)
; char.is_octal_digit(Char) ->
get_octal_escape(QuoteChar, Chars, [Char], Token, !IO)
;
@@ -923,6 +927,12 @@
; Char = 'x' ->
string_get_hex_escape(String, Len, QuoteChar, Chars, [],
Posn0, Token, Context, !Posn)
+ ; Char = 'u' ->
+ string_get_unicode_escape(4, String, Len, QuoteChar, Chars,
+ [], Posn0, Token, Context, !Posn)
+ ; Char = 'U' ->
+ string_get_unicode_escape(8, String, Len, QuoteChar, Chars,
+ [], Posn0, Token, Context, !Posn)
; char.is_octal_digit(Char) ->
string_get_octal_escape(String, Len, QuoteChar, Chars, [Char],
Posn0, Token, Context, !Posn)
@@ -971,6 +981,111 @@
)
).
+:- pred get_unicode_escape(int::in, char::in, list(char)::in, list(char)::in,
+ token::out, io::di, io::uo) is det.
+
+get_unicode_escape(NumHexChars, QuoteChar, Chars, HexChars, Token, !IO) :-
+ ( if NumHexChars = list.length(HexChars) then
+ rev_char_list_to_string(HexChars, HexString),
+ ( if
+ string.base_string_to_int(16, HexString, UnicodeCharCode),
+ encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars)
+ then
+ get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
+ Token, !IO)
+ else
+ Token = error("invalid unicode character code")
+ )
+ else
+ io.read_char(Result, !IO),
+ (
+ Result = error(Error),
+ Token = io_error(Error)
+ ;
+ Result = eof,
+ Token = error("unterminated quote")
+ ;
+ Result = ok(Char),
+ ( if char.is_hex_digit(Char) then
+ get_unicode_escape(NumHexChars, QuoteChar, Chars,
+ [Char | HexChars], Token, !IO)
+ else
+ Token = error("invalid hex character in unicode escape")
+ )
+ )
+ ).
+
+:- pred string_get_unicode_escape(int::in, string::in, int::in, char::in,
+ list(char)::in, list(char)::in, posn::in, token::out,
+ string_token_context::out, posn::in, posn::out) is det.
+
+string_get_unicode_escape(NumHexChars, String, Len, QuoteChar, Chars,
+ HexChars, Posn0, Token, Context, !Posn) :-
+ ( if NumHexChars = list.length(HexChars) then
+ rev_char_list_to_string(HexChars, HexString),
+ ( if
+ string.base_string_to_int(16, HexString, UnicodeCharCode),
+ encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars)
+ then
+ RevCharsWithUnicode = list.reverse(UTF8Chars) ++ Chars,
+ string_get_quoted_name(String, Len, QuoteChar, RevCharsWithUnicode,
+ Posn0, Token, Context, !Posn)
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("invalid unicode character code")
+ )
+ else
+ ( if string_read_char(String, Len, Char, !Posn) then
+ ( if char.is_hex_digit(Char) then
+ string_get_unicode_escape(NumHexChars, String, Len, QuoteChar,
+ Chars, [Char | HexChars], Posn0, Token, Context, !Posn)
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("invalid hex character in unicode escape")
+ )
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("unterminated quote")
+ )
+ ).
+
+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
+
+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
+ ( if UnicodeCharCode >= 0 then
+ ( if UnicodeCharCode =< 0x00007F then
+ UTF8Chars = [char.det_from_int(UnicodeCharCode)]
+ else if UnicodeCharCode =< 0x0007FF then
+ Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
+ Part2 = 0b00000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ UTF8Chars = [UTF8Char1, UTF8Char2]
+ else if UnicodeCharCode =< 0x00FFFF then
+ Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
+ Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
+ Part3 = 0b0000000000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
+ else if UnicodeCharCode =< 0x10FFFF then
+ Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
+ Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
+ Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
+ Part4 = 0b000000000000000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+ char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
+ else
+ fail
+ )
+ else
+ fail
+ ).
+
:- pred string_get_hex_escape(string::in, int::in, char::in,
list(char)::in, list(char)::in, posn::in, token::out,
string_token_context::out, posn::in, posn::out) is det.
Index: library/list.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/list.m,v
retrieving revision 1.150
diff -u -r1.150 list.m
--- library/list.m 15 May 2006 09:12:05 -0000 1.150
+++ library/list.m 4 Jul 2006 13:35:59 -0000
@@ -1021,14 +1021,17 @@
% Same as list.filter_map/3 except that it only returns the first
% match:
% find_first_map(X, Y, Z) <=> list.filter_map(X, Y, [Z | _])
+ %
:- pred list.find_first_map(pred(X, Y)::in(pred(in, out) is semidet),
list(X)::in, Y::out) is semidet.
% Same as list.find_first_map, except with two outputs.
+ %
:- pred list.find_first_map2(pred(X, A, B)::in(pred(in, out, out) is semidet),
list(X)::in, A::out, B::out) is semidet.
% Same as list.find_first_map, except with three outputs.
+ %
:- pred list.find_first_map3(
pred(X, A, B, C)::in(pred(in, out, out, out) is semidet),
list(X)::in, A::out, B::out, C::out) is semidet.
Index: library/string.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/string.m,v
retrieving revision 1.245
diff -u -r1.245 string.m
--- library/string.m 19 Apr 2006 05:17:57 -0000 1.245
+++ library/string.m 4 Jul 2006 14:16:30 -0000
@@ -33,12 +33,21 @@
% Determine the length of a string.
% An empty string has length zero.
%
+ % NOTE: for UTF-8 encoded strings, the length function returns the number
+ % of bytes in the UTF-8 representation of the string, not the number of
+ % unicode characters in the string. Use utf8_length/1 to find the number
+ % of unicode characters in the string.
+ %
:- func string.length(string) = int.
:- mode string.length(in) = uo is det.
:- pred string.length(string, int).
:- mode string.length(in, uo) is det.
:- mode string.length(ui, uo) is det.
+ % Return the number of unicode characters in a UTF-8 encoded string.
+ %
+:- func string.utf8_length(string) = int.
+
% Append two strings together.
%
:- func string.append(string, string) = string.
@@ -3472,6 +3481,24 @@
Length = Index
).
+string.utf8_length(Str) = Length :-
+ string.foldl(add_one_if_first_byte_in_utf8_char, Str, 0, Length).
+
+:- pred add_one_if_first_byte_in_utf8_char(char::in, int::in, int::out)
+ is det.
+
+add_one_if_first_byte_in_utf8_char(Char, !Acc) :-
+ Int = char.to_int(Char),
+ ( if
+ ( Int =< 0x7F
+ ; (Int /\ 0b11000000) = 0b11000000
+ )
+ then
+ !:Acc = !.Acc + 1
+ else
+ true
+ ).
+
/*-----------------------------------------------------------------------*/
:- pragma promise_pure(string.append/3).
Index: tests/hard_coded/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
retrieving revision 1.288
diff -u -r1.288 Mmakefile
--- tests/hard_coded/Mmakefile 30 Jun 2006 12:51:55 -0000 1.288
+++ tests/hard_coded/Mmakefile 4 Jul 2006 10:04:30 -0000
@@ -213,6 +213,7 @@
type_spec_ho_term \
type_spec_modes \
type_to_term_bug \
+ unicode \
unify_existq_cons \
unify_expression \
unify_typeinfo_bug \
Index: tests/hard_coded/unicode.exp
===================================================================
RCS file: tests/hard_coded/unicode.exp
diff -N tests/hard_coded/unicode.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.exp 4 Jul 2006 10:01:52 -0000
@@ -0,0 +1,11 @@
+00000011
+00000011
+11001110 10010100
+11001110 10100000
+11101111 10111111 10111111
+11110100 10001111 10111111 10111111
+11110010 10101011 10110011 10011110
+01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
+01100001 01100010 01100011 00110001 00110010 00110011
+
+[1, 1, 1, 1, 1, 1, 1, 6, 6]
Index: tests/hard_coded/unicode.m
===================================================================
RCS file: tests/hard_coded/unicode.m
diff -N tests/hard_coded/unicode.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.m 4 Jul 2006 10:01:15 -0000
@@ -0,0 +1,45 @@
+:- module unicode.
+
+:- interface.
+
+:- import_module io.
+
+:- pred main(io::di, io::uo) is det.
+
+:- implementation.
+
+:- import_module char.
+:- import_module list.
+:- import_module string.
+
+main(!IO) :-
+ list.foldl(write_string_as_binary, utf8_strings, !IO),
+ io.nl(!IO),
+ Lengths = list.map(string.utf8_length, utf8_strings),
+ io.write(Lengths, !IO),
+ io.nl(!IO).
+
+:- func utf8_strings = list(string).
+
+utf8_strings = [
+ "\u0003",
+ "\U00000003",
+ "\u0394", % delta
+ "\u03A0", % pi
+ "\uFFFF",
+ "\U0010FFFF",
+ "\U000ABCDE",
+ "r\u00E9sum\u00E9", % "resume" with accents
+ "abc123"
+].
+
+:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
+
+write_string_as_binary(Str, !IO) :-
+ Chars = string.to_char_list(Str),
+ Ints = list.map(to_int, Chars),
+ Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
+ PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
+ Bin = string.join_list(" ", PaddedBins),
+ io.write_string(Bin, !IO),
+ io.nl(!IO).
Index: tests/invalid/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/invalid/Mmakefile,v
retrieving revision 1.193
diff -u -r1.193 Mmakefile
--- tests/invalid/Mmakefile 16 Jun 2006 07:24:41 -0000 1.193
+++ tests/invalid/Mmakefile 4 Jul 2006 12:53:42 -0000
@@ -197,6 +197,8 @@
undef_symbol \
undef_type \
undef_type_mod_qual \
+ unicode1 \
+ unicode2 \
unify_mode_error \
uniq_modes \
uniq_neg \
Index: tests/invalid/unicode1.exp
===================================================================
RCS file: tests/invalid/unicode1.exp
diff -N tests/invalid/unicode1.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.exp 4 Jul 2006 10:07:16 -0000
@@ -0,0 +1,5 @@
+unicode1.m:005: Syntax error: invalid hex character in unicode escape.
+unicode1.m:005: Syntax error: unterminated quote.
+unicode1.m:001: Warning: interface for module `unicode1' does not export
+unicode1.m:001: anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode1.m
===================================================================
RCS file: tests/invalid/unicode1.m
diff -N tests/invalid/unicode1.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.m 4 Jul 2006 10:07:06 -0000
@@ -0,0 +1,5 @@
+:- module unicode1.
+
+:- interface.
+
+:- type '\uabxy' == int.
Index: tests/invalid/unicode2.exp
===================================================================
RCS file: tests/invalid/unicode2.exp
diff -N tests/invalid/unicode2.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.exp 4 Jul 2006 12:30:52 -0000
@@ -0,0 +1,5 @@
+unicode2.m:005: Syntax error: invalid unicode character code.
+unicode2.m:005: Syntax error: unterminated quote.
+unicode2.m:001: Warning: interface for module `unicode2' does not export
+unicode2.m:001: anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode2.m
===================================================================
RCS file: tests/invalid/unicode2.m
diff -N tests/invalid/unicode2.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.m 4 Jul 2006 12:27:46 -0000
@@ -0,0 +1,5 @@
+:- module unicode2.
+
+:- interface.
+
+:- type '\UFFFFFFFF' == int.
--------------------------------------------------------------------------
mercury-reviews mailing list
post: mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
More information about the reviews
mailing list