[m-rev.] for review: add some unicode support to Mercury
Ian MacLarty
maclarty at csse.unimelb.edu.au
Fri Jul 21 11:01:50 AEST 2006
On Wed, Jul 19, 2006 at 04:01:57PM +1000, Peter Moulder wrote:
> On Wed, Jul 05, 2006 at 12:24:25AM +1000, Ian MacLarty wrote:
>
> > +The sequence @samp{\x} introduces
> > a hexadecimal escape; it must be followed by a sequence of hexadecimal
> > digits and then a closing backslash. It is replaced
> > with the character whose character code is identified by the hexadecimal
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> I suggest changing to `byte whose value', to clarify that e.g. \xa0\ is
> replaced by just one byte rather than being equivalent to \u00a0.
>
> > + % Return the number of unicode characters in a UTF-8 encoded string.
>
> I suggest explicitly stating that the result is undefined (or
> unspecified or similar) if the given string isn't valid utf-8. (The
> existing documentation gives the false impression that it counts only
> valid, complete unicode characters.)
>
> > +++ tests/hard_coded/unicode.m 4 Jul 2006 10:01:15 -0000
> ...
> > +utf8_strings = [
> > + "\u0003",
> > + "\U00000003",
> > + "\u0394", % delta
> > + "\u03A0", % pi
> > + "\uFFFF",
> > + "\U0010FFFF",
> > + "\U000ABCDE",
> > + "r\u00E9sum\u00E9", % "resume" with accents
> > + "abc123"
> > +].
>
> It would be nice to add "\u005cu0041" as an example (0x5c = backslash),
> and similarly "\x5c\u0041", "\x5c\\u0041", "\\u0041" and "u0041".
> It would be good for some of these examples to use lowercase hex digits.
>
> Otherwise looks fine to me.
>
Here's the new diff and CVS log (the interdiff is almost as big as the
diff, so I'm just posting the diff).
I'll post the new unicode module as a separate change.
Estimated hours taken: 6
Branches: main
Add escape sequences for encoding unicode characters in Mercury string
literals. The unicode characters are encoded in UTF-8 on the C backends and
in UTF-16 on the Java and IL backends.
NEWS:
Mention the changes.
doc/reference_manual.texi:
Document the new escape sequences.
Divide the section on string tokens up into multiple paragraphs.
library/lexer.m:
Convert unicode characters encoded with the new escape sequences to
UTF-8 on the C backends and UTF-16 on the Java and IL backends.
Some of the new predicate may be moved to a "unicode" module that I'm
intending to add to the standard library in a subsequent change.
library/list.m:
Fix the formatting of some comments.
tests/hard_coded/Mmakefile:
tests/hard_coded/unicode.exp:
tests/hard_coded/unicode.m:
tests/invalid/Mmakefile:
tests/invalid/unicode1.exp:
tests/invalid/unicode1.m:
tests/invalid/unicode2.exp:
tests/invalid/unicode2.m:
Test the new support for unicode.
Index: NEWS
===================================================================
RCS file: /home/mercury1/repository/mercury/NEWS,v
retrieving revision 1.415
diff -u -r1.415 NEWS
--- NEWS 20 Jul 2006 01:50:41 -0000 1.415
+++ NEWS 20 Jul 2006 08:21:32 -0000
@@ -32,6 +32,8 @@
* ':' is now the type qualification operator, not a module qualifier.
* To ensure soundness, goals in negated contexts using non-local variables
with dynamic modes (inst "any") must now be marked as impure.
+* Unicode characters can now be encoded in string literals using an
+ escape sequence.
Changes to the Mercury standard library:
* We have removed the predicates dealing with runtime type information (RTTI)
@@ -292,6 +294,13 @@
sure that the goal is in fact pure, e.g. because they know that
the goal inside the negation will not instantiate the variable.
+* Unicode characters can now be encoded in string literals using an
+ escape sequence. The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
+ (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced with
+ the corresponding unicode character. The encoding used to represent the
+ unicode character is implementation dependent. For the Melbourne Mercury
+ compiler unicode characters are represented using UTF-8 for the C backends.
+
Changes to the Mercury standard library:
* We have added the function `divide_equivalence_classes' to the `eqvclass'
Index: doc/reference_manual.texi
===================================================================
RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
retrieving revision 1.361
diff -u -r1.361 reference_manual.texi
--- doc/reference_manual.texi 20 Jul 2006 02:44:04 -0000 1.361
+++ doc/reference_manual.texi 20 Jul 2006 07:58:33 -0000
@@ -220,21 +220,35 @@
@item string
A string is a sequence of characters enclosed in double quotes (@code{"}).
+
Within a string, two adjacent double quotes stand for a single double quote.
For example, the string @samp{ """" } is a string of length one, containing
a single double quote: the outermost pair of double quotes encloses the
string, and the innermost pair stand for a single double quote.
+
Strings may also contain backslash escapes. @samp{\a} stands for ``alert''
(a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
@samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
@samp{\v} for vertical-tab. An escaped backslash, single-quote, or
-double-quote stands for itself. The sequence @samp{\x} introduces
+double-quote stands for itself.
+
+The sequence @samp{\x} introduces
a hexadecimal escape; it must be followed by a sequence of hexadecimal
digits and then a closing backslash. It is replaced
with the character whose character code is identified by the hexadecimal
number. Similarly, a backslash followed by an octal digit is the
beginning of an octal escape; as with hexadecimal escapes, the sequence
of octal digits must be terminated with a closing backslash.
+
+The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
+ at samp{\u} must be followed by the unicode character code expressed as four
+hexidecimal digits.
+ at samp{\U} must be followed by the unicode character code expressed as eight
+hexidecimal digits.
+The encoding used for unicode characters is implementation dependent.
+For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
+for the Java and IL backends.
+
A backslash followed immediately by a newline is deleted; thus an
escaped newline can be used to continue a string over more than one
source line. (String literals may also contain embedded newlines.)
Index: library/lexer.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
retrieving revision 1.46
diff -u -r1.46 lexer.m
--- library/lexer.m 19 Apr 2006 05:17:53 -0000 1.46
+++ library/lexer.m 20 Jul 2006 07:13:33 -0000
@@ -895,6 +895,10 @@
get_quoted_name(QuoteChar, Chars1, Token, !IO)
; Char = 'x' ->
get_hex_escape(QuoteChar, Chars, [], Token, !IO)
+ ; Char = 'u' ->
+ get_unicode_escape(4, QuoteChar, Chars, [], Token, !IO)
+ ; Char = 'U' ->
+ get_unicode_escape(8, QuoteChar, Chars, [], Token, !IO)
; char.is_octal_digit(Char) ->
get_octal_escape(QuoteChar, Chars, [Char], Token, !IO)
;
@@ -923,6 +927,12 @@
; Char = 'x' ->
string_get_hex_escape(String, Len, QuoteChar, Chars, [],
Posn0, Token, Context, !Posn)
+ ; Char = 'u' ->
+ string_get_unicode_escape(4, String, Len, QuoteChar, Chars,
+ [], Posn0, Token, Context, !Posn)
+ ; Char = 'U' ->
+ string_get_unicode_escape(8, String, Len, QuoteChar, Chars,
+ [], Posn0, Token, Context, !Posn)
; char.is_octal_digit(Char) ->
string_get_octal_escape(String, Len, QuoteChar, Chars, [Char],
Posn0, Token, Context, !Posn)
@@ -971,6 +981,195 @@
)
).
+:- pred get_unicode_escape(int::in, char::in, list(char)::in, list(char)::in,
+ token::out, io::di, io::uo) is det.
+
+get_unicode_escape(NumHexChars, QuoteChar, Chars, HexChars, Token, !IO) :-
+ ( if NumHexChars = list.length(HexChars) then
+ rev_char_list_to_string(HexChars, HexString),
+ ( if
+ string.base_string_to_int(16, HexString, UnicodeCharCode),
+ convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+ then
+ get_quoted_name(QuoteChar, list.reverse(UTF8Chars) ++ Chars,
+ Token, !IO)
+ else
+ Token = error("invalid unicode character code")
+ )
+ else
+ io.read_char(Result, !IO),
+ (
+ Result = error(Error),
+ Token = io_error(Error)
+ ;
+ Result = eof,
+ Token = error("unterminated quote")
+ ;
+ Result = ok(Char),
+ ( if char.is_hex_digit(Char) then
+ get_unicode_escape(NumHexChars, QuoteChar, Chars,
+ [Char | HexChars], Token, !IO)
+ else
+ Token = error("invalid hex character in unicode escape")
+ )
+ )
+ ).
+
+:- pred string_get_unicode_escape(int::in, string::in, int::in, char::in,
+ list(char)::in, list(char)::in, posn::in, token::out,
+ string_token_context::out, posn::in, posn::out) is det.
+
+string_get_unicode_escape(NumHexChars, String, Len, QuoteChar, Chars,
+ HexChars, Posn0, Token, Context, !Posn) :-
+ ( if NumHexChars = list.length(HexChars) then
+ rev_char_list_to_string(HexChars, HexString),
+ ( if
+ string.base_string_to_int(16, HexString, UnicodeCharCode),
+ convert_unicode_char_to_target_chars(UnicodeCharCode, UTF8Chars)
+ then
+ RevCharsWithUnicode = list.reverse(UTF8Chars) ++ Chars,
+ string_get_quoted_name(String, Len, QuoteChar, RevCharsWithUnicode,
+ Posn0, Token, Context, !Posn)
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("invalid unicode character code")
+ )
+ else
+ ( if string_read_char(String, Len, Char, !Posn) then
+ ( if char.is_hex_digit(Char) then
+ string_get_unicode_escape(NumHexChars, String, Len, QuoteChar,
+ Chars, [Char | HexChars], Posn0, Token, Context, !Posn)
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("invalid hex character in unicode escape")
+ )
+ else
+ string_get_context(Posn0, Context, !Posn),
+ Token = error("unterminated quote")
+ )
+ ).
+
+:- pred convert_unicode_char_to_target_chars(int::in, list(char)::out)
+ is semidet.
+
+convert_unicode_char_to_target_chars(UnicodeCharCode, Chars) :-
+ BackendEncoding = backend_unicode_encoding,
+ (
+ BackendEncoding = utf8,
+ encode_unicode_char_as_utf8(UnicodeCharCode, Chars)
+ ;
+ BackendEncoding = utf16,
+ encode_unicode_char_as_utf16(UnicodeCharCode, Chars)
+ ).
+
+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
+
+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
+ allowed_unicode_char_code(UnicodeCharCode),
+ ( if UnicodeCharCode =< 0x00007F then
+ UTF8Chars = [char.det_from_int(UnicodeCharCode)]
+ else if UnicodeCharCode =< 0x0007FF then
+ Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
+ Part2 = 0b00000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ UTF8Chars = [UTF8Char1, UTF8Char2]
+ else if UnicodeCharCode =< 0x00FFFF then
+ Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
+ Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
+ Part3 = 0b0000000000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
+ else
+ Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
+ Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
+ Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
+ Part4 = 0b000000000000000111111 /\ UnicodeCharCode,
+ char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
+ char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
+ ).
+
+:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
+
+ % This predicate should only be called on backends that have
+ % a 16 bit character type.
+ %
+encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
+ allowed_unicode_char_code(UnicodeCharCode),
+ ( if UnicodeCharCode =< 0xFFFF then
+ char.det_from_int(UnicodeCharCode, Char),
+ UTF16Chars = [Char]
+ else
+ LeadOffset = 0xDB00 - (0x10000 >> 10),
+ Lead = LeadOffset + (UnicodeCharCode >> 10),
+ Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
+ char.det_from_int(Lead, LeadChar),
+ char.det_from_int(Trail, TrailChar),
+ UTF16Chars = [LeadChar, TrailChar]
+ ).
+
+:- pred allowed_unicode_char_code(int::in) is semidet.
+
+allowed_unicode_char_code(Code) :-
+ Code >= 0,
+ % The following ranges are reserved for utf-16 surrogates.
+ not (
+ Code >= 0xD800, Code =< 0xDBFF
+ ;
+ Code >= 0xDC00, Code =< 0xDFFF
+ ),
+ Code =< 0x10FFFF.
+
+:- type unicode_encoding
+ ---> utf8
+ ; utf16.
+
+:- func backend_unicode_encoding = unicode_encoding.
+
+backend_unicode_encoding = Encoding :-
+ Int = backend_unicode_encoding_int,
+ ( unicode_encoding_int_to_encoding(Int, Encoding0) ->
+ Encoding = Encoding0
+ ;
+ error("backend_unicode_encoding: unexpected unicode encoding code")
+ ).
+
+:- pred unicode_encoding_int_to_encoding(int::in, unicode_encoding::out)
+ is semidet.
+
+unicode_encoding_int_to_encoding(0, utf8).
+unicode_encoding_int_to_encoding(1, utf16).
+
+:- func backend_unicode_encoding_int = int.
+
+:- pragma inline(backend_unicode_encoding_int/0).
+
+:- pragma foreign_proc("C",
+ backend_unicode_encoding_int = (EncodingInt::out),
+ [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+ EncodingInt = 0;
+").
+
+:- pragma foreign_proc("Java",
+ backend_unicode_encoding_int = (EncodingInt::out),
+ [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+ EncodingInt = 1;
+").
+
+:- pragma foreign_proc("Java",
+ backend_unicode_encoding_int = (EncodingInt::out),
+ [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
+"
+ EncodingInt = 1;
+").
+
:- pred string_get_hex_escape(string::in, int::in, char::in,
list(char)::in, list(char)::in, posn::in, token::out,
string_token_context::out, posn::in, posn::out) is det.
Index: library/list.m
===================================================================
RCS file: /home/mercury1/repository/mercury/library/list.m,v
retrieving revision 1.150
diff -u -r1.150 list.m
--- library/list.m 15 May 2006 09:12:05 -0000 1.150
+++ library/list.m 4 Jul 2006 13:35:59 -0000
@@ -1021,14 +1021,17 @@
% Same as list.filter_map/3 except that it only returns the first
% match:
% find_first_map(X, Y, Z) <=> list.filter_map(X, Y, [Z | _])
+ %
:- pred list.find_first_map(pred(X, Y)::in(pred(in, out) is semidet),
list(X)::in, Y::out) is semidet.
% Same as list.find_first_map, except with two outputs.
+ %
:- pred list.find_first_map2(pred(X, A, B)::in(pred(in, out, out) is semidet),
list(X)::in, A::out, B::out) is semidet.
% Same as list.find_first_map, except with three outputs.
+ %
:- pred list.find_first_map3(
pred(X, A, B, C)::in(pred(in, out, out, out) is semidet),
list(X)::in, A::out, B::out, C::out) is semidet.
Index: tests/hard_coded/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
retrieving revision 1.289
diff -u -r1.289 Mmakefile
--- tests/hard_coded/Mmakefile 19 Jul 2006 15:19:07 -0000 1.289
+++ tests/hard_coded/Mmakefile 20 Jul 2006 07:58:41 -0000
@@ -214,6 +214,7 @@
type_spec_ho_term \
type_spec_modes \
type_to_term_bug \
+ unicode \
unify_existq_cons \
unify_expression \
unify_typeinfo_bug \
Index: tests/hard_coded/unicode.exp
===================================================================
RCS file: tests/hard_coded/unicode.exp
diff -N tests/hard_coded/unicode.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.exp 21 Jul 2006 00:50:43 -0000
@@ -0,0 +1,29 @@
+00000011
+00000011
+11001110 10010100
+11001110 10100000
+11101111 10111111 10111111
+11110100 10001111 10111111 10111111
+11110010 10101011 10110011 10011110
+01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001
+01100001 01100010 01100011 00110001 00110010 00110011
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01110101 00110000 00110000 00110100 00110001
+01011100 01000001
+01011100 01110101 00110000 00110000 00110100 00110001
+01110101 00110000 00110000 00110100 00110001
+
+
+
+Δ
+Π
+
+
+
+résumé
+abc123
+\u0041
+\u0041
+\A
+\u0041
+u0041
Index: tests/hard_coded/unicode.m
===================================================================
RCS file: tests/hard_coded/unicode.m
diff -N tests/hard_coded/unicode.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/hard_coded/unicode.m 20 Jul 2006 07:56:50 -0000
@@ -0,0 +1,50 @@
+:- module unicode.
+
+:- interface.
+
+:- import_module io.
+
+:- pred main(io::di, io::uo) is det.
+
+:- implementation.
+
+:- import_module char.
+:- import_module list.
+:- import_module string.
+
+main(!IO) :-
+ list.foldl(write_string_as_binary, utf8_strings, !IO),
+ io.nl(!IO),
+ Str = string.join_list("\n", utf8_strings),
+ io.write_string(Str, !IO),
+ io.nl(!IO).
+
+:- func utf8_strings = list(string).
+
+utf8_strings = [
+ "\u0003",
+ "\U00000003",
+ "\u0394", % delta
+ "\u03A0", % pi
+ "\uFFFF",
+ "\U0010ffff",
+ "\U000ABCde",
+ "r\u00E9sum\u00E9", % "resume" with accents
+ "abc123",
+ "\u005cu0041",
+ "\x5c\u0041",
+ "\x5c\\u0041",
+ "\\u0041",
+ "u0041"
+].
+
+:- pred write_string_as_binary(string::in, io::di, io::uo) is det.
+
+write_string_as_binary(Str, !IO) :-
+ Chars = string.to_char_list(Str),
+ Ints = list.map(to_int, Chars),
+ Bins = list.map(( func(X) = int_to_base_string(X, 2) ), Ints),
+ PaddedBins = list.map(( func(S) = pad_left(S, '0', 8) ), Bins),
+ Bin = string.join_list(" ", PaddedBins),
+ io.write_string(Bin, !IO),
+ io.nl(!IO).
Index: tests/invalid/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/tests/invalid/Mmakefile,v
retrieving revision 1.195
diff -u -r1.195 Mmakefile
--- tests/invalid/Mmakefile 12 Jul 2006 02:51:22 -0000 1.195
+++ tests/invalid/Mmakefile 20 Jul 2006 07:58:43 -0000
@@ -199,6 +199,8 @@
undef_symbol \
undef_type \
undef_type_mod_qual \
+ unicode1 \
+ unicode2 \
unify_mode_error \
uniq_modes \
uniq_neg \
Index: tests/invalid/unicode1.exp
===================================================================
RCS file: tests/invalid/unicode1.exp
diff -N tests/invalid/unicode1.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.exp 4 Jul 2006 10:07:16 -0000
@@ -0,0 +1,5 @@
+unicode1.m:005: Syntax error: invalid hex character in unicode escape.
+unicode1.m:005: Syntax error: unterminated quote.
+unicode1.m:001: Warning: interface for module `unicode1' does not export
+unicode1.m:001: anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode1.m
===================================================================
RCS file: tests/invalid/unicode1.m
diff -N tests/invalid/unicode1.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode1.m 4 Jul 2006 10:07:06 -0000
@@ -0,0 +1,5 @@
+:- module unicode1.
+
+:- interface.
+
+:- type '\uabxy' == int.
Index: tests/invalid/unicode2.exp
===================================================================
RCS file: tests/invalid/unicode2.exp
diff -N tests/invalid/unicode2.exp
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.exp 4 Jul 2006 12:30:52 -0000
@@ -0,0 +1,5 @@
+unicode2.m:005: Syntax error: invalid unicode character code.
+unicode2.m:005: Syntax error: unterminated quote.
+unicode2.m:001: Warning: interface for module `unicode2' does not export
+unicode2.m:001: anything.
+For more information, recompile with `-E'.
Index: tests/invalid/unicode2.m
===================================================================
RCS file: tests/invalid/unicode2.m
diff -N tests/invalid/unicode2.m
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ tests/invalid/unicode2.m 4 Jul 2006 12:27:46 -0000
@@ -0,0 +1,5 @@
+:- module unicode2.
+
+:- interface.
+
+:- type '\UFFFFFFFF' == int.
--------------------------------------------------------------------------
mercury-reviews mailing list
post: mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------
More information about the reviews
mailing list