[m-rev.] for review: add some unicode support to Mercury
Ian MacLarty
maclarty at csse.unimelb.edu.au
Thu Jul 27 10:49:00 AEST 2006
On Fri, Jul 21, 2006 at 04:34:20PM +1000, Julien Fischer wrote:
>
> On Fri, 21 Jul 2006, Ian MacLarty wrote:
>
> >Estimated hours taken: 6
> >Branches: main
> >
> >Add escape sequences for encoding unicode characters in Mercury string
> >literals. The unicode characters are encoded in UTF-8 on the C backends
> >and
> >in UTF-16 on the Java and IL backends.
> >
> >NEWS:
> > Mention the changes.
> >
> >doc/reference_manual.texi:
> > Document the new escape sequences.
> >
> > Divide the section on string tokens up into multiple paragraphs.
> >
> >library/lexer.m:
> > Convert unicode characters encoded with the new escape sequences to
> > UTF-8 on the C backends and UTF-16 on the Java and IL backends.
> >
> > Some of the new predicate may be moved to a "unicode" module that I'm
>
> s/predicate/predicates/
>
Fixed.
> > intending to add to the standard library in a subsequent change.
> >
> >library/list.m:
> > Fix the formatting of some comments.
> >
> >tests/hard_coded/Mmakefile:
> >tests/hard_coded/unicode.exp:
> >tests/hard_coded/unicode.m:
> >tests/invalid/Mmakefile:
> >tests/invalid/unicode1.exp:
> >tests/invalid/unicode1.m:
> >tests/invalid/unicode2.exp:
> >tests/invalid/unicode2.m:
> > Test the new support for unicode.
> >
> >Index: NEWS
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/NEWS,v
> >retrieving revision 1.415
> >diff -u -r1.415 NEWS
> >--- NEWS 20 Jul 2006 01:50:41 -0000 1.415
> >+++ NEWS 20 Jul 2006 08:21:32 -0000
> >@@ -32,6 +32,8 @@
> >* ':' is now the type qualification operator, not a module qualifier.
> >* To ensure soundness, goals in negated contexts using non-local variables
> > with dynamic modes (inst "any") must now be marked as impure.
> >+* Unicode characters can now be encoded in string literals using an
> >+ escape sequence.
> >
> >Changes to the Mercury standard library:
> >* We have removed the predicates dealing with runtime type information
> >(RTTI)
> >@@ -292,6 +294,13 @@
> > sure that the goal is in fact pure, e.g. because they know that
> > the goal inside the negation will not instantiate the variable.
> >
> >+* Unicode characters can now be encoded in string literals using an
> >+ escape sequence. The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
> >+ (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced
> >with
> >+ the corresponding unicode character. The encoding used to represent the
> >+ unicode character is implementation dependent. For the Melbourne
> >Mercury
> >+ compiler unicode characters are represented using UTF-8 for the C
> >backends.
> >+
> >Changes to the Mercury standard library:
> >
> >* We have added the function `divide_equivalence_classes' to the `eqvclass'
> >Index: doc/reference_manual.texi
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
> >retrieving revision 1.361
> >diff -u -r1.361 reference_manual.texi
> >--- doc/reference_manual.texi 20 Jul 2006 02:44:04 -0000 1.361
> >+++ doc/reference_manual.texi 20 Jul 2006 07:58:33 -0000
> >@@ -220,21 +220,35 @@
> >
> >@item string
> >A string is a sequence of characters enclosed in double quotes (@code{"}).
> >+
> >Within a string, two adjacent double quotes stand for a single double
> >quote.
> >For example, the string @samp{ """" } is a string of length one, containing
> >a single double quote: the outermost pair of double quotes encloses the
> >string, and the innermost pair stand for a single double quote.
> >+
> >Strings may also contain backslash escapes. @samp{\a} stands for ``alert''
> >(a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
> >@samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
> >@samp{\v} for vertical-tab. An escaped backslash, single-quote, or
> >-double-quote stands for itself. The sequence @samp{\x} introduces
> >+double-quote stands for itself.
> >+
> >+The sequence @samp{\x} introduces
> >a hexadecimal escape; it must be followed by a sequence of hexadecimal
> >digits and then a closing backslash. It is replaced
> >with the character whose character code is identified by the hexadecimal
> >number. Similarly, a backslash followed by an octal digit is the
> >beginning of an octal escape; as with hexadecimal escapes, the sequence
> >of octal digits must be terminated with a closing backslash.
> >+
> >+The sequence @samp{\u} or @samp{\U} can be used to escape unicode
> >characters.
> >+ at samp{\u} must be followed by the unicode character code expressed as four
> >+hexidecimal digits.
> >+ at samp{\U} must be followed by the unicode character code expressed as
> >eight
> >+hexidecimal digits.
> >+The encoding used for unicode characters is implementation dependent.
> >+For the Melbourne Mercury compiler it is UTF-8 for the C backends and
> >UTF-16
> >+for the Java and IL backends.
>
> ...
>
> >Index: library/lexer.m
> >===================================================================
> >RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
> >retrieving revision 1.46
> >diff -u -r1.46 lexer.m
> >--- library/lexer.m 19 Apr 2006 05:17:53 -0000 1.46
> >+++ library/lexer.m 20 Jul 2006 07:13:33 -0000
> >+:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
> >+
> >+encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
> >+ allowed_unicode_char_code(UnicodeCharCode),
> >+ ( if UnicodeCharCode =< 0x00007F then
> >+ UTF8Chars = [char.det_from_int(UnicodeCharCode)]
> >+ else if UnicodeCharCode =< 0x0007FF then
> >+ Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
> >+ Part2 = 0b00000111111 /\ UnicodeCharCode,
> >+ char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
> >+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+ UTF8Chars = [UTF8Char1, UTF8Char2]
> >+ else if UnicodeCharCode =< 0x00FFFF then
> >+ Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
> >+ Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
> >+ Part3 = 0b0000000000111111 /\ UnicodeCharCode,
> >+ char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
> >+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> >+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
> >+ else
> >+ Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
> >+ Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
> >+ Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
> >+ Part4 = 0b000000000000000111111 /\ UnicodeCharCode,
> >+ char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
> >+ char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> >+ char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> >+ char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
> >+ UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
> >+ ).
>
> Please document the algorithm used by the above code.
>
I've put a reference to the appropriate section of the Unicode standard.
> >+:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
> >+
> >+ % This predicate should only be called on backends that have
> >+ % a 16 bit character type.
> >+ %
> >+encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
> >+ allowed_unicode_char_code(UnicodeCharCode),
> >+ ( if UnicodeCharCode =< 0xFFFF then
> >+ char.det_from_int(UnicodeCharCode, Char),
> >+ UTF16Chars = [Char]
> >+ else
> >+ LeadOffset = 0xDB00 - (0x10000 >> 10),
> >+ Lead = LeadOffset + (UnicodeCharCode >> 10),
> >+ Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
> >+ char.det_from_int(Lead, LeadChar),
> >+ char.det_from_int(Trail, TrailChar),
> >+ UTF16Chars = [LeadChar, TrailChar]
> >+ ).
> >+
>
> And this one.
>
Same here.
> ...
>
> >Index: tests/hard_coded/Mmakefile
> >===================================================================
> >RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
> >retrieving revision 1.289
> >diff -u -r1.289 Mmakefile
> >--- tests/hard_coded/Mmakefile 19 Jul 2006 15:19:07 -0000 1.289
> >+++ tests/hard_coded/Mmakefile 20 Jul 2006 07:58:41 -0000
> >@@ -214,6 +214,7 @@
> > type_spec_ho_term \
> > type_spec_modes \
> > type_to_term_bug \
> >+ unicode \
>
> That's a poor choice of module name for that test case if you are
> planning to add a unicode module to the standard library.
> (Unless you're also planning to put the stdlib in the std namespace as
> well ;-) )
>
I've changed it to "unicode_test".
Ian.
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to: mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions: mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
More information about the reviews
mailing list