[m-rev.] for review: add some unicode support to Mercury
Julien Fischer
juliensf at csse.unimelb.edu.au
Fri Jul 21 16:34:20 AEST 2006
On Fri, 21 Jul 2006, Ian MacLarty wrote:
> Estimated hours taken: 6
> Branches: main
>
> Add escape sequences for encoding unicode characters in Mercury string
> literals. The unicode characters are encoded in UTF-8 on the C backends and
> in UTF-16 on the Java and IL backends.
>
> NEWS:
> Mention the changes.
>
> doc/reference_manual.texi:
> Document the new escape sequences.
>
> Divide the section on string tokens up into multiple paragraphs.
>
> library/lexer.m:
> Convert unicode characters encoded with the new escape sequences to
> UTF-8 on the C backends and UTF-16 on the Java and IL backends.
>
> Some of the new predicate may be moved to a "unicode" module that I'm
s/predicate/predicates/
> intending to add to the standard library in a subsequent change.
>
> library/list.m:
> Fix the formatting of some comments.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/unicode.exp:
> tests/hard_coded/unicode.m:
> tests/invalid/Mmakefile:
> tests/invalid/unicode1.exp:
> tests/invalid/unicode1.m:
> tests/invalid/unicode2.exp:
> tests/invalid/unicode2.m:
> Test the new support for unicode.
>
> Index: NEWS
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/NEWS,v
> retrieving revision 1.415
> diff -u -r1.415 NEWS
> --- NEWS 20 Jul 2006 01:50:41 -0000 1.415
> +++ NEWS 20 Jul 2006 08:21:32 -0000
> @@ -32,6 +32,8 @@
> * ':' is now the type qualification operator, not a module qualifier.
> * To ensure soundness, goals in negated contexts using non-local variables
> with dynamic modes (inst "any") must now be marked as impure.
> +* Unicode characters can now be encoded in string literals using an
> + escape sequence.
>
> Changes to the Mercury standard library:
> * We have removed the predicates dealing with runtime type information (RTTI)
> @@ -292,6 +294,13 @@
> sure that the goal is in fact pure, e.g. because they know that
> the goal inside the negation will not instantiate the variable.
>
> +* Unicode characters can now be encoded in string literals using an
> + escape sequence. The escape sequence \uXXXX (or \UXXXXXXXX), where XXXX
> + (or XXXXXXXX) is a unicode character code in hexidecimal, is replaced with
> + the corresponding unicode character. The encoding used to represent the
> + unicode character is implementation dependent. For the Melbourne Mercury
> + compiler unicode characters are represented using UTF-8 for the C backends.
> +
> Changes to the Mercury standard library:
>
> * We have added the function `divide_equivalence_classes' to the `eqvclass'
> Index: doc/reference_manual.texi
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/doc/reference_manual.texi,v
> retrieving revision 1.361
> diff -u -r1.361 reference_manual.texi
> --- doc/reference_manual.texi 20 Jul 2006 02:44:04 -0000 1.361
> +++ doc/reference_manual.texi 20 Jul 2006 07:58:33 -0000
> @@ -220,21 +220,35 @@
>
> @item string
> A string is a sequence of characters enclosed in double quotes (@code{"}).
> +
> Within a string, two adjacent double quotes stand for a single double quote.
> For example, the string @samp{ """" } is a string of length one, containing
> a single double quote: the outermost pair of double quotes encloses the
> string, and the innermost pair stand for a single double quote.
> +
> Strings may also contain backslash escapes. @samp{\a} stands for ``alert''
> (a beep character), @samp{\b} for backspace, @samp{\r} for carriage-return,
> @samp{\f} for form-feed, @samp{\t} for tab, @samp{\n} for newline,
> @samp{\v} for vertical-tab. An escaped backslash, single-quote, or
> -double-quote stands for itself. The sequence @samp{\x} introduces
> +double-quote stands for itself.
> +
> +The sequence @samp{\x} introduces
> a hexadecimal escape; it must be followed by a sequence of hexadecimal
> digits and then a closing backslash. It is replaced
> with the character whose character code is identified by the hexadecimal
> number. Similarly, a backslash followed by an octal digit is the
> beginning of an octal escape; as with hexadecimal escapes, the sequence
> of octal digits must be terminated with a closing backslash.
> +
> +The sequence @samp{\u} or @samp{\U} can be used to escape unicode characters.
> + at samp{\u} must be followed by the unicode character code expressed as four
> +hexidecimal digits.
> + at samp{\U} must be followed by the unicode character code expressed as eight
> +hexidecimal digits.
> +The encoding used for unicode characters is implementation dependent.
> +For the Melbourne Mercury compiler it is UTF-8 for the C backends and UTF-16
> +for the Java and IL backends.
...
> Index: library/lexer.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/library/lexer.m,v
> retrieving revision 1.46
> diff -u -r1.46 lexer.m
> --- library/lexer.m 19 Apr 2006 05:17:53 -0000 1.46
> +++ library/lexer.m 20 Jul 2006 07:13:33 -0000
> +:- pred encode_unicode_char_as_utf8(int::in, list(char)::out) is semidet.
> +
> +encode_unicode_char_as_utf8(UnicodeCharCode, UTF8Chars) :-
> + allowed_unicode_char_code(UnicodeCharCode),
> + ( if UnicodeCharCode =< 0x00007F then
> + UTF8Chars = [char.det_from_int(UnicodeCharCode)]
> + else if UnicodeCharCode =< 0x0007FF then
> + Part1 = (0b11111000000 /\ UnicodeCharCode) >> 6,
> + Part2 = 0b00000111111 /\ UnicodeCharCode,
> + char.det_from_int(Part1 \/ 0b11000000, UTF8Char1),
> + char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> + UTF8Chars = [UTF8Char1, UTF8Char2]
> + else if UnicodeCharCode =< 0x00FFFF then
> + Part1 = (0b1111000000000000 /\ UnicodeCharCode) >> 12,
> + Part2 = (0b0000111111000000 /\ UnicodeCharCode) >> 6,
> + Part3 = 0b0000000000111111 /\ UnicodeCharCode,
> + char.det_from_int(Part1 \/ 0b11100000, UTF8Char1),
> + char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> + char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> + UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3]
> + else
> + Part1 = (0b111000000000000000000 /\ UnicodeCharCode) >> 18,
> + Part2 = (0b000111111000000000000 /\ UnicodeCharCode) >> 12,
> + Part3 = (0b000000000111111000000 /\ UnicodeCharCode) >> 6,
> + Part4 = 0b000000000000000111111 /\ UnicodeCharCode,
> + char.det_from_int(Part1 \/ 0b11110000, UTF8Char1),
> + char.det_from_int(Part2 \/ 0b10000000, UTF8Char2),
> + char.det_from_int(Part3 \/ 0b10000000, UTF8Char3),
> + char.det_from_int(Part4 \/ 0b10000000, UTF8Char4),
> + UTF8Chars = [UTF8Char1, UTF8Char2, UTF8Char3, UTF8Char4]
> + ).
Please document the algorithm used by the above code.
> +:- pred encode_unicode_char_as_utf16(int::in, list(char)::out) is semidet.
> +
> + % This predicate should only be called on backends that have
> + % a 16 bit character type.
> + %
> +encode_unicode_char_as_utf16(UnicodeCharCode, UTF16Chars) :-
> + allowed_unicode_char_code(UnicodeCharCode),
> + ( if UnicodeCharCode =< 0xFFFF then
> + char.det_from_int(UnicodeCharCode, Char),
> + UTF16Chars = [Char]
> + else
> + LeadOffset = 0xDB00 - (0x10000 >> 10),
> + Lead = LeadOffset + (UnicodeCharCode >> 10),
> + Trail = 0xDC00 + (UnicodeCharCode /\ 0x3FF),
> + char.det_from_int(Lead, LeadChar),
> + char.det_from_int(Trail, TrailChar),
> + UTF16Chars = [LeadChar, TrailChar]
> + ).
> +
And this one.
...
> Index: tests/hard_coded/Mmakefile
> ===================================================================
> RCS file: /home/mercury1/repository/tests/hard_coded/Mmakefile,v
> retrieving revision 1.289
> diff -u -r1.289 Mmakefile
> --- tests/hard_coded/Mmakefile 19 Jul 2006 15:19:07 -0000 1.289
> +++ tests/hard_coded/Mmakefile 20 Jul 2006 07:58:41 -0000
> @@ -214,6 +214,7 @@
> type_spec_ho_term \
> type_spec_modes \
> type_to_term_bug \
> + unicode \
That's a poor choice of module name for that test case if you are
planning to add a unicode module to the standard library.
(Unless you're also planning to put the stdlib in the std namespace as
well ;-) )
Julien.
--------------------------------------------------------------------------
mercury-reviews mailing list
post: mercury-reviews at csse.unimelb.edu.au
administrative address: owner-mercury-reviews at csse.unimelb.edu.au
unsubscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: unsubscribe
subscribe: Address: mercury-reviews-request at csse.unimelb.edu.au Message: subscribe
--------------------------------------------------------------------------
More information about the reviews
mailing list