[m-rev.] for review: add some unicode support to Mercury

Ian MacLarty maclarty at csse.unimelb.edu.au
Thu Jul 27 10:49:00 AEST 2006


On Fri, Jul 21, 2006 at 04:34:20PM +1000, Julien Fischer wrote:
> 
> On Fri, 21 Jul 2006, Ian MacLarty wrote:
> 
> >Estimated hours taken: 6
> >Branches: main
> >
> >Add escape sequences for encoding unicode characters in Mercury string
> >literals.  The unicode characters are encoded in UTF-8 on the C backends 
> >and
> >in UTF-16 on the Java and IL backends.
> >
> >NEWS:
> >	Mention the changes.
> >
> >doc/reference_manual.texi:
> >	Document the new escape sequences.
> >
> >	Divide the section on string tokens up into multiple paragraphs.
> >
> >library/lexer.m:
> >	Convert unicode characters encoded with the new escape sequences to
> >	UTF-8 on the C backends and UTF-16 on the Java and IL backends.
> >
> >	Some of the new predicate may be moved to a "unicode" module that I'm
> 
> s/predicate/predicates/
> 

Fixed.

> >	intending to add to the standard library in a subsequent change.
> >
> >library/list.m:
> >	Fix the formatting of some comments.
> >
> >tests/hard_coded/Mmakefile:
> >tests/hard_coded/unicode.exp:
> >tests/hard_coded/unicode.m:
> >tests/invalid/Mmakefile:
> >tests/invalid/unicode1.exp:
> >tests/invalid/unicode1.m:
> >tests/invalid/unicode2.exp:
> >tests/invalid/unicode2.m:
> >	Test the new support for unicode.
> >
> >Index: NEWS
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/NEWS,v
> >retrieving revision 1.415
> >diff -u -r1.415 NEWS
> >--- NEWS	20 Jul 2006 01:50:41 -0000	1.415
> >+++ NEWS	20 Jul 2006 08:21:32 -0000
> >@@ -32,6 +32,8 @@
> >* ':' is now the type qualification operator, not a module qualifier.
> >* To ensure soundness, goals in negated contexts using non-local variables
> >  with dynamic modes (inst "any") must now be marked as impure.
> >+* Unicode characters can now be encoded in string literals using an
> >+  escape sequence.
> >
> >Changes to the Mercury standard library:
> >* We have removed the predicates dealing with runtime type information 
> >(RTTI)
> >@@ -292,6 +294,13 @@
> >  sure that the goal is in fact pure, e.g. because they know that
> >  the goal inside the negation will not instantiate the variable.
> >
> >+* Unicode characters can now be encoded in string literals using an
> >+  escape sequence.  The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
> >+  (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced 
> >with
> >+  the corresponding unicode character.  The encoding used to represent the
> >+  unicode character is implementation dependent.  For the Melbourne 
> >Mercury
> >+  compiler unicode characters are represented using UTF-8 for the C 
> >backends.
> >+
> >Changes to the Mercury standard library:
> >
> >* We have added the function `divide_equivalence_classes' to the `eqvclass'
> >Index: doc/reference_manual.texi
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
> >retrieving revision 1.361
> >diff -u -r1.361 reference_manual.texi
> >--- doc/reference_manual.texi	20 Jul 2006 02:44:04 -0000	1.361
> >+++ doc/reference_manual.texi	20 Jul 2006 07:58:33 -0000
> >@@ -220,21 +220,35 @@
> >
> >@item string
> >A string is a sequence of characters enclosed in double quotes (@code{"}).
> >+
> >Within a string, two adjacent double quotes stand for a single double 
> >quote.
> >For example, the string @samp{ """" } is a string of length one, containing
> >a single double quote: the outermost pair of double quotes encloses the
> >string, and the innermost pair stand for a single double quote.
> >+
> >Strings may also contain backslash escapes.  @samp{\a} stands for ``alert''
> >(a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
> >@samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
> >@samp{\v} for vertical-tab.  An escaped backslash, single-quote, or
> >-double-quote stands for itself.  The sequence @samp{\x} introduces
> >+double-quote stands for itself.
> >+
> >+The sequence @samp{\x} introduces
> >a hexadecimal escape; it must be followed by a sequence of hexadecimal
> >digits and then a closing backslash.  It is replaced
> >with the character whose character code is identified by the hexadecimal
> >number.  Similarly, a backslash followed by an octal digit is the
> >beginning of an octal escape; as with hexadecimal escapes, the sequence
> >of octal digits must be terminated with a closing backslash.
> >+
> >+The sequence @samp{\u} or @samp{\U} can be used to escape unicode 
> >characters.
> >+ at samp{\u} must be followed by the unicode character code expressed as four
> >+hexidecimal digits.
> >+ at samp{\U} must be followed by the unicode character code expressed as 
> >eight
> >+hexidecimal digits.
> >+The encoding used for unicode characters is implementation dependent.
> >+For the Melbourne Mercury compiler it is UTF-8 for the C backends and 
> >UTF-16
> >+for the Java and IL backends.
> 
> ...
> 
> >Index: library/lexer.m
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
> >retrieving revision 1.46
> >diff -u -r1.46 lexer.m
> >--- library/lexer.m	19 Apr 2006 05:17:53 -0000	1.46
> >+++ library/lexer.m	20 Jul 2006 07:13:33 -0000
> >+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
> >+
> >+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
> >+    allowed_unicode_char_code(UnicodeCharCode),
> >+    ( if UnicodeCharCode =< 0x00007F then
> >+        UTF8Chars = [char.det_from_int(UnicodeCharCode)]
> >+    else if UnicodeCharCode =< 0x0007FF then
> >+        Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
> >+        Part2 =  0b00000111111 /\ UnicodeCharCode,
> >+        char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
> >+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+        UTF8Chars = [UTF8Char1, UTF8Char2]
> >+    else if UnicodeCharCode =< 0x00FFFF then
> >+        Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
> >+        Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
> >+        Part3 =  0b0000000000111111 /\ UnicodeCharCode,
> >+        char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
> >+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> >+        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
> >+    else
> >+        Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
> >+        Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
> >+        Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
> >+        Part4 =  0b000000000000000111111 /\ UnicodeCharCode,
> >+        char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
> >+        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> >+        char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
> >+        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
> >+    ).
> 
> Please document the algorithm used by the above code.
> 

I've put a reference to the appropriate section of the Unicode standard.

> >+:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
> >+
> >+    % This predicate should only be called on backends that have
> >+    % a 16 bit character type.
> >+    %
> >+encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
> >+    allowed_unicode_char_code(UnicodeCharCode),
> >+    ( if UnicodeCharCode =< 0xFFFF then
> >+        char.det_from_int(UnicodeCharCode, Char),
> >+        UTF16Chars = [Char]
> >+    else
> >+        LeadOffset = 0xDB00 - (0x10000 >> 10),
> >+        Lead = LeadOffset + (UnicodeCharCode >> 10),
> >+        Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
> >+        char.det_from_int(Lead, LeadChar),
> >+        char.det_from_int(Trail, TrailChar),
> >+        UTF16Chars = [LeadChar, TrailChar]
> >+    ).
> >+
> 
> And this one.
> 

Same here.

> ...
> 
> >Index: tests/hard_coded/Mmakefile
> >===================================================================
> >RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
> >retrieving revision 1.289
> >diff -u -r1.289 Mmakefile
> >--- tests/hard_coded/Mmakefile	19 Jul 2006 15:19:07 -0000	1.289
> >+++ tests/hard_coded/Mmakefile	20 Jul 2006 07:58:41 -0000
> >@@ -214,6 +214,7 @@
> >	type_spec_ho_term \
> >	type_spec_modes \
> >	type_to_term_bug \
> >+	unicode \
> 
> That's a poor choice of module name for that test case if you are
> planning to add a unicode module to the standard library.
> (Unless you're also planning to put the stdlib in the std namespace as
> well ;-) )
> 

I've changed it to "unicode_test".

Ian.
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list