[m-rev.] for review: make characters an instance of the uenum typeclass

Peter Wang novalazy at gmail.com
Tue Dec 20 18:54:10 AEDT 2022


On Tue, 20 Dec 2022 18:00:32 +1100 Julien Fischer <jfischer at opturion.com> wrote:
> 
> For review by anyone.
> 
> ---------------------
> 
> Make characters an instance of the uenum typeclass
> 
> The recent change to sparse_bitsets broke the lex library in extras.
> Specifically, we now now need to make characters an instance of the
> uenum typeclass. This diff does so.
> 
> library/char.m:
>       Add predicates and functions for converting between unsigned integers
>       and characters.
> 
>       Make characters an instance of the uenum typeclass.
> 
> tests/hard_coded/Mmakefile:
> tests/hard_coded/char_uint_conv.{m,exp,exp2}:
>       Add a test of the above conversions.
> 
> NEWS:
>       Announce the additions.
> 
> extras/lex.m:
>       Conform to recent changes.
> 
> Julien.
> 
> diff --git a/NEWS b/NEWS
> index a5322b4..913abd3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -102,6 +102,18 @@ Changes to the Mercury standard library
>       - func `promise_only_solution/1`
>       - pred `promise_only_solution_io/4`
> 
> +### Changes to the `char` module
> +
> +* The following type has had its typeclass memberships changed:
> +
> +    - The type `character` is now an instance of the new `uenum` typeclass.
> +

Just state this without the leadup?

> +* The following predicate and functions have been added:
> +
> +   - func `to_uint/1`
> +   - pred `from_uint/2`
> +   - func `det_from_uint/1`
> +
>   ### Changes to the `cord` module
> 
>   * The following predicates have been added:
> diff --git a/extras/lex/lex.m b/extras/lex/lex.m
> index 1f934c2..5e8722b 100644
> --- a/extras/lex/lex.m
> +++ b/extras/lex/lex.m
> @@ -98,7 +98,7 @@
>   :- instance regexp(regexp).
>   :- instance regexp(char).
>   :- instance regexp(string).
> -:- instance regexp(sparse_bitset(T)) <= (regexp(T),enum(T)).
> +:- instance regexp(sparse_bitset(T)) <= (regexp(T),uenum(T)).
> 
>       % Some basic non-primitive regexps.
>       %
> @@ -720,10 +720,10 @@ read_from_string(Offset, Result, String, unsafe_promise_unique(String)) :-
>           )
>   ].
> 
> -:- instance regexp(sparse_bitset(T)) <= (regexp(T),enum(T)) where [
> +:- instance regexp(sparse_bitset(T)) <= (regexp(T),uenum(T)) where [
>       re(SparseBitset) = charset(Charset) :-
>           Charset = sparse_bitset.foldl(
> -            func(Enum, Set0) = insert(Set0, char.det_from_int(to_int(Enum))),
> +            func(Enum, Set0) = insert(Set0, char.det_from_uint(to_uint(Enum))),
>               SparseBitset,
>               sparse_bitset.init)
>   ].

(BTW, sparse_bitset is an inefficient representation for large charsets,
like valid_unicode_chars. diet should be better.)

> +:- pragma foreign_proc("C",
> +    to_uint(Character::in) = (UInt::out),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> +        does_not_affect_liveness],
> +"
> +    UInt = (MR_UnsignedChar) Character;
> +").
> +
> +:- pragma foreign_proc("C#",
> +    to_uint(Character::in) = (UInt::out),
> +    [will_not_call_mercury, promise_pure, thread_safe],
> +"
> +    UInt = (uint) Character;
> +").
> +
> +:- pragma foreign_proc("Java",
> +    to_uint(Character::in) = (UInt::out),
> +    [will_not_call_mercury, promise_pure, thread_safe],
> +"
> +    UInt = Character;
> +").
> +
> +:- pragma foreign_proc("C",
> +    from_uint(UInt::in, Character::out),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> +        does_not_affect_liveness],
> +"
> +    Character = (MR_UnsignedChar) UInt;
> +    SUCCESS_INDICATOR = (UInt <= 0x10ffff);
> +").
> +
> +:- pragma foreign_proc("C#",
> +    from_uint(UInt::in, Character::out),
> +    [will_not_call_mercury, promise_pure, thread_safe],
> +"
> +    Character = (int) UInt;
> +    SUCCESS_INDICATOR = (UInt <= 0x10ffff);
> +").
> +
> +:- pragma foreign_proc("Java",
> +    from_uint(UInt::in, Character::out),
> +    [will_not_call_mercury, promise_pure, thread_safe],
> +"
> +    Character = UInt;
> +    SUCCESS_INDICATOR = ((UInt & 0xffffffffL) <= (0x10ffff & 0xffffffffL));
> +").

Do we need the foreign procs, or can we cast to/from int?

> diff --git a/tests/hard_coded/char_uint_conv.m b/tests/hard_coded/char_uint_conv.m
> index e69de29..bc59026 100644
> --- a/tests/hard_coded/char_uint_conv.m
> +++ b/tests/hard_coded/char_uint_conv.m
...
> +:- func test_chars = list(char).
> +
> +test_chars = [
> +    % C0 Controls and Basic Latin
> +    % (First BMP block)
> +    char.det_from_int(0x0),   % NULL CHARACTER
> +    char.det_from_int(0x1),   % START OF HEADING
> +    char.det_from_int(0x1f),  % UNIT SEPARATOR
> +    ' ',
> +    '0',
> +    'A',
> +    'Z',
> +    'a',
> +    'z',
> +    '~',
> +    char.det_from_int(0x7f), % DELETE
> +
> +    % C1 Controls and Latin-1 Supplement
> +    char.det_from_int(0x80),   % PADDING CHARACTER
> +    char.det_from_int(0x9f),   % APPLICATION PROGRAM COMMAND
> +    char.det_from_int(0xa0),   % NO-BREAK SPACE
> +    char.det_from_int(0xbf),   % INVERTED QUESTION MARK
> +    char.det_from_int(0xc0),   % LATIN CAPITAL LETTER A WITH GRAVE
> +    char.det_from_int(0xff),   % LATIN SMALL LETTER Y WITH DIAERESIS
> +
> +    % Linear B Syllabary
> +    % (First SMP block)
> +    char.det_from_int(0x10000), % LINEAR B SYLLABLE B008 A
> +    char.det_from_int(0x1005d), % LINEAR B SYMBOL B089
> +
> +    % Symbols for Legacy Computing.
> +    % (Last SMP block)
> +    char.det_from_int(0x1fb00), % BLOCK SEXTANT-1
> +    char.det_from_int(0x1fbf9), % SEGEMENTED DIGIT NINE
> +
> +    % CJK Unified Idenographs Extension B

Ideographs

> +    % (First SIP block)
> +    char.det_from_int(0x20000),
> +    char.det_from_int(0x2a6df),
> +
> +    % CJK Compatibility Ideographs Supplement
> +    % (Last SIP block),
> +    char.det_from_int(0x2f800),
> +    char.det_from_int(0x2fa1f),
> +
> +    % Last valid Unicode code point.
> +    char.det_from_int(0x10ffff)].

That looks fine, otherwise.

Peter


More information about the reviews mailing list