[m-rev.] for review: add some unicode support to Mercury

Julien Fischer juliensf at csse.unimelb.edu.au
Fri Jul 21 16:34:20 AEST 2006


On Fri, 21 Jul 2006, Ian MacLarty wrote:

> Estimated hours taken: 6
> Branches: main
>
> Add escape sequences for encoding unicode characters in Mercury string
> literals.  The unicode characters are encoded in UTF-8 on the C backends and
> in UTF-16 on the Java and IL backends.
>
> NEWS:
> 	Mention the changes.
>
> doc/reference_manual.texi:
> 	Document the new escape sequences.
>
> 	Divide the section on string tokens up into multiple paragraphs.
>
> library/lexer.m:
> 	Convert unicode characters encoded with the new escape sequences to
> 	UTF-8 on the C backends and UTF-16 on the Java and IL backends.
>
> 	Some of the new predicate may be moved to a "unicode" module that I'm

s/predicate/predicates/

> 	intending to add to the standard library in a subsequent change.
>
> library/list.m:
> 	Fix the formatting of some comments.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/unicode.exp:
> tests/hard_coded/unicode.m:
> tests/invalid/Mmakefile:
> tests/invalid/unicode1.exp:
> tests/invalid/unicode1.m:
> tests/invalid/unicode2.exp:
> tests/invalid/unicode2.m:
> 	Test the new support for unicode.
>
> Index: NEWS
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/NEWS,v
> retrieving revision 1.415
> diff -u -r1.415 NEWS
> --- NEWS	20 Jul 2006 01:50:41 -0000	1.415
> +++ NEWS	20 Jul 2006 08:21:32 -0000
> @@ -32,6 +32,8 @@
> * ':' is now the type qualification operator, not a module qualifier.
> * To ensure soundness, goals in negated contexts using non-local variables
>   with dynamic modes (inst "any") must now be marked as impure.
> +* Unicode characters can now be encoded in string literals using an
> +  escape sequence.
>
> Changes to the Mercury standard library:
> * We have removed the predicates dealing with runtime type information (RTTI)
> @@ -292,6 +294,13 @@
>   sure that the goal is in fact pure, e.g. because they know that
>   the goal inside the negation will not instantiate the variable.
>
> +* Unicode characters can now be encoded in string literals using an
> +  escape sequence.  The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
> +  (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced with
> +  the corresponding unicode character.  The encoding used to represent the
> +  unicode character is implementation dependent.  For the Melbourne Mercury
> +  compiler unicode characters are represented using UTF-8 for the C backends.
> +
> Changes to the Mercury standard library:
>
> * We have added the function `divide_equivalence_classes' to the `eqvclass'
> Index: doc/reference_manual.texi
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
> retrieving revision 1.361
> diff -u -r1.361 reference_manual.texi
> --- doc/reference_manual.texi	20 Jul 2006 02:44:04 -0000	1.361
> +++ doc/reference_manual.texi	20 Jul 2006 07:58:33 -0000
> @@ -220,21 +220,35 @@
>
> @item string
> A string is a sequence of characters enclosed in double quotes (@code{"}).
> +
> Within a string, two adjacent double quotes stand for a single double quote.
> For example, the string @samp{ """" } is a string of length one, containing
> a single double quote: the outermost pair of double quotes encloses the
> string, and the innermost pair stand for a single double quote.
> +
> Strings may also contain backslash escapes.  @samp{\a} stands for ``alert''
> (a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
> @samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
> @samp{\v} for vertical-tab.  An escaped backslash, single-quote, or
> -double-quote stands for itself.  The sequence @samp{\x} introduces
> +double-quote stands for itself.
> +
> +The sequence @samp{\x} introduces
> a hexadecimal escape; it must be followed by a sequence of hexadecimal
> digits and then a closing backslash.  It is replaced
> with the character whose character code is identified by the hexadecimal
> number.  Similarly, a backslash followed by an octal digit is the
> beginning of an octal escape; as with hexadecimal escapes, the sequence
> of octal digits must be terminated with a closing backslash.
> +
> +The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
> + at samp{\u} must be followed by the unicode character code expressed as four
> +hexidecimal digits.
> + at samp{\U} must be followed by the unicode character code expressed as eight
> +hexidecimal digits.
> +The encoding used for unicode characters is implementation dependent.
> +For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
> +for the Java and IL backends.

...

> Index: library/lexer.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
> retrieving revision 1.46
> diff -u -r1.46 lexer.m
> --- library/lexer.m	19 Apr 2006 05:17:53 -0000	1.46
> +++ library/lexer.m	20 Jul 2006 07:13:33 -0000
> +:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
> +
> +encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
> +    allowed_unicode_char_code(UnicodeCharCode),
> +    ( if UnicodeCharCode =< 0x00007F then
> +        UTF8Chars = [char.det_from_int(UnicodeCharCode)]
> +    else if UnicodeCharCode =< 0x0007FF then
> +        Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
> +        Part2 =  0b00000111111 /\ UnicodeCharCode,
> +        char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
> +        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> +        UTF8Chars = [UTF8Char1, UTF8Char2]
> +    else if UnicodeCharCode =< 0x00FFFF then
> +        Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
> +        Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
> +        Part3 =  0b0000000000111111 /\ UnicodeCharCode,
> +        char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
> +        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> +        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> +        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
> +    else
> +        Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
> +        Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
> +        Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
> +        Part4 =  0b000000000000000111111 /\ UnicodeCharCode,
> +        char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
> +        char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> +        char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> +        char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
> +        UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
> +    ).

Please document the algorithm used by the above code.

> +:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
> +
> +    % This predicate should only be called on backends that have
> +    % a 16 bit character type.
> +    %
> +encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
> +    allowed_unicode_char_code(UnicodeCharCode),
> +    ( if UnicodeCharCode =< 0xFFFF then
> +        char.det_from_int(UnicodeCharCode, Char),
> +        UTF16Chars = [Char]
> +    else
> +        LeadOffset = 0xDB00 - (0x10000 >> 10),
> +        Lead = LeadOffset + (UnicodeCharCode >> 10),
> +        Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
> +        char.det_from_int(Lead, LeadChar),
> +        char.det_from_int(Trail, TrailChar),
> +        UTF16Chars = [LeadChar, TrailChar]
> +    ).
> +

And this one.

...

> Index: tests/hard_coded/Mmakefile
> ===================================================================
> RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
> retrieving revision 1.289
> diff -u -r1.289 Mmakefile
> --- tests/hard_coded/Mmakefile	19 Jul 2006 15:19:07 -0000	1.289
> +++ tests/hard_coded/Mmakefile	20 Jul 2006 07:58:41 -0000
> @@ -214,6 +214,7 @@
> 	type_spec_ho_term \
> 	type_spec_modes \
> 	type_to_term_bug \
> +	unicode \

That's a poor choice of module name for that test case if you are
planning to add a unicode module to the standard library.
(Unless you're also planning to put the stdlib in the std namespace as
well ;-) )

Julien.
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list