[m-rev.] for review: make characters an instance of the uenum typeclass
Peter Wang
novalazy at gmail.com
Tue Dec 20 18:54:10 AEDT 2022
On Tue, 20 Dec 2022 18:00:32 +1100 Julien Fischer <jfischer at opturion.com> wrote:
>
> For review by anyone.
>
> ---------------------
>
> Make characters an instance of the uenum typeclass
>
> The recent change to sparse_bitsets broke the lex library in extras.
> Specifically, we now now need to make characters an instance of the
> uenum typeclass. This diff does so.
>
> library/char.m:
> Add predicates and functions for converting between unsigned integers
> and characters.
>
> Make characters an instance of the uenum typeclass.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/char_uint_conv.{m,exp,exp2}:
> Add a test of the above conversions.
>
> NEWS:
> Announce the additions.
>
> extras/lex.m:
> Conform to recent changes.
>
> Julien.
>
> diff --git a/NEWS b/NEWS
> index a5322b4..913abd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -102,6 +102,18 @@ Changes to the Mercury standard library
> - func `promise_only_solution/1`
> - pred `promise_only_solution_io/4`
>
> +### Changes to the `char` module
> +
> +* The following type has had its typeclass memberships changed:
> +
> + - The type `character` is now an instance of the new `uenum` typeclass.
> +
Just state this without the leadup?
> +* The following predicate and functions have been added:
> +
> + - func `to_uint/1`
> + - pred `from_uint/2`
> + - func `det_from_uint/1`
> +
> ### Changes to the `cord` module
>
> * The following predicates have been added:
> diff --git a/extras/lex/lex.m b/extras/lex/lex.m
> index 1f934c2..5e8722b 100644
> --- a/extras/lex/lex.m
> +++ b/extras/lex/lex.m
> @@ -98,7 +98,7 @@
> :- instance regexp(regexp).
> :- instance regexp(char).
> :- instance regexp(string).
> -:- instance regexp(sparse_bitset(T)) <= (regexp(T),enum(T)).
> +:- instance regexp(sparse_bitset(T)) <= (regexp(T),uenum(T)).
>
> % Some basic non-primitive regexps.
> %
> @@ -720,10 +720,10 @@ read_from_string(Offset, Result, String, unsafe_promise_unique(String)) :-
> )
> ].
>
> -:- instance regexp(sparse_bitset(T)) <= (regexp(T),enum(T)) where [
> +:- instance regexp(sparse_bitset(T)) <= (regexp(T),uenum(T)) where [
> re(SparseBitset) = charset(Charset) :-
> Charset = sparse_bitset.foldl(
> - func(Enum, Set0) = insert(Set0, char.det_from_int(to_int(Enum))),
> + func(Enum, Set0) = insert(Set0, char.det_from_uint(to_uint(Enum))),
> SparseBitset,
> sparse_bitset.init)
> ].
(BTW, sparse_bitset is an inefficient representation for large charsets,
like valid_unicode_chars. diet should be better.)
> +:- pragma foreign_proc("C",
> + to_uint(Character::in) = (UInt::out),
> + [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> + does_not_affect_liveness],
> +"
> + UInt = (MR_UnsignedChar) Character;
> +").
> +
> +:- pragma foreign_proc("C#",
> + to_uint(Character::in) = (UInt::out),
> + [will_not_call_mercury, promise_pure, thread_safe],
> +"
> + UInt = (uint) Character;
> +").
> +
> +:- pragma foreign_proc("Java",
> + to_uint(Character::in) = (UInt::out),
> + [will_not_call_mercury, promise_pure, thread_safe],
> +"
> + UInt = Character;
> +").
> +
> +:- pragma foreign_proc("C",
> + from_uint(UInt::in, Character::out),
> + [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> + does_not_affect_liveness],
> +"
> + Character = (MR_UnsignedChar) UInt;
> + SUCCESS_INDICATOR = (UInt <= 0x10ffff);
> +").
> +
> +:- pragma foreign_proc("C#",
> + from_uint(UInt::in, Character::out),
> + [will_not_call_mercury, promise_pure, thread_safe],
> +"
> + Character = (int) UInt;
> + SUCCESS_INDICATOR = (UInt <= 0x10ffff);
> +").
> +
> +:- pragma foreign_proc("Java",
> + from_uint(UInt::in, Character::out),
> + [will_not_call_mercury, promise_pure, thread_safe],
> +"
> + Character = UInt;
> + SUCCESS_INDICATOR = ((UInt & 0xffffffffL) <= (0x10ffff & 0xffffffffL));
> +").
Do we need the foreign procs, or can we cast to/from int?
> diff --git a/tests/hard_coded/char_uint_conv.m b/tests/hard_coded/char_uint_conv.m
> index e69de29..bc59026 100644
> --- a/tests/hard_coded/char_uint_conv.m
> +++ b/tests/hard_coded/char_uint_conv.m
...
> +:- func test_chars = list(char).
> +
> +test_chars = [
> + % C0 Controls and Basic Latin
> + % (First BMP block)
> + char.det_from_int(0x0), % NULL CHARACTER
> + char.det_from_int(0x1), % START OF HEADING
> + char.det_from_int(0x1f), % UNIT SEPARATOR
> + ' ',
> + '0',
> + 'A',
> + 'Z',
> + 'a',
> + 'z',
> + '~',
> + char.det_from_int(0x7f), % DELETE
> +
> + % C1 Controls and Latin-1 Supplement
> + char.det_from_int(0x80), % PADDING CHARACTER
> + char.det_from_int(0x9f), % APPLICATION PROGRAM COMMAND
> + char.det_from_int(0xa0), % NO-BREAK SPACE
> + char.det_from_int(0xbf), % INVERTED QUESTION MARK
> + char.det_from_int(0xc0), % LATIN CAPITAL LETTER A WITH GRAVE
> + char.det_from_int(0xff), % LATIN SMALL LETTER Y WITH DIAERESIS
> +
> + % Linear B Syllabary
> + % (First SMP block)
> + char.det_from_int(0x10000), % LINEAR B SYLLABLE B008 A
> + char.det_from_int(0x1005d), % LINEAR B SYMBOL B089
> +
> + % Symbols for Legacy Computing.
> + % (Last SMP block)
> + char.det_from_int(0x1fb00), % BLOCK SEXTANT-1
> + char.det_from_int(0x1fbf9), % SEGEMENTED DIGIT NINE
> +
> + % CJK Unified Idenographs Extension B
Ideographs
> + % (First SIP block)
> + char.det_from_int(0x20000),
> + char.det_from_int(0x2a6df),
> +
> + % CJK Compatibility Ideographs Supplement
> + % (Last SIP block),
> + char.det_from_int(0x2f800),
> + char.det_from_int(0x2fa1f),
> +
> + % Last valid Unicode code point.
> + char.det_from_int(0x10ffff)].
That looks fine, otherwise.
Peter
More information about the reviews
mailing list