[m-rev.] for review: encoding chars as uint8s or uint16s

Peter Wang novalazy at gmail.com
Tue Jul 20 18:08:30 AEST 2021


On Tue, 20 Jul 2021 14:58:56 +1000 Julien Fischer <jfischer at opturion.com> wrote:
> 
> For review by anyone.
> 
> There's one other thing I would like feedback on here.  All of these
> predicates currently fail when the input is either a surrogate (which
> is fine) or if the character is outside the valid Unicode code point
> range (0x0 .. 0x10ffff).  I think the latter case should cause an
> exception to be thrown.
> 

I don't exactly object to throwing an exception, but chars are always
supposed to be in the Unicode range. Obviously, it's possible to
introduce a bad char via the FFI.

> --------------------------
> 
> Encoding chars as uint8s or uint16s.
> 
> library/char.m:
>      Add to_utf8_uint8/2 and to_utf16_uint16/2. These do the
>      same thing as the existing to_utf8/2 and to_utf16/2 predicates,
>      but return the code units as uint8s or uint16s respectively.
> 
>      Update the comment at the head of the module about which version
>      of the Unicode standard this module conforms to.
>      (For the purposes of this module, nothing has changed between
>      versions 10 and 13.)
> 
> NEWS:
>      Announce the additions.
> 
> Julien.

> diff --git a/library/char.m b/library/char.m
> index bce9809..1d84da5 100644
...
> @@ -975,6 +985,32 @@ to_utf8(Char, CodeUnits) :-
>           fail
>       ).
> 
> +to_utf8_uint8(Char, CodeUnits) :-
> +    Int = char.to_int(Char),
> +    ( if Int =< 0x7f then
> +        A = uint8.cast_from_int(Int),
> +        CodeUnits = [A]
> +    else if Int =< 0x7ff then
> +        A = uint8.cast_from_int(0xc0 \/ ((Int >> 6) /\ 0x1f)),
> +        B = uint8.cast_from_int(0x80 \/  (Int       /\ 0x3f)),
> +        CodeUnits = [A, B]
> +    else if Int =< 0xffff then
> +        not is_surrogate(Char),
> +        A = uint8.cast_from_int(0xe0 \/ ((Int >> 12) /\ 0x0f)),
> +        B = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
> +        C = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
> +        CodeUnits = [A, B, C]
> +    else if Int =< 0x10ffff then
> +        A = uint8.cast_from_int(0xf0 \/ ((Int >> 18) /\ 0x07)),
> +        B = uint8.cast_from_int(0x80 \/ ((Int >> 12) /\ 0x3f)),
> +        C = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
> +        D = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
> +        CodeUnits = [A, B, C, D]
> +    else
> +        % Illegal code point.
> +        fail
> +    ).
> +

Perhaps use a shared predicate to avoid the code duplication:

    to_utf8(Char, CodeUnits) :-
        to_utf8_code_units(Char, Length, B0, B1, B2, B3),
        (
            Length = 1,
            CodeUnits = [uint8.to_int(B0)]
        ;
            Length = 2,
            CodeUnits = [uint8.to_int(B1), uint8.to_int(B0)]
        ;
            Length = 3,
            CodeUnits = [uint8.to_int(B2), uint8.to_int(B1), uint8.to_int(B0)]
        ;
            Length = 4,
            CodeUnits = [uint8.to_int(B3), uint8.to_int(B2),
                         uint8.to_int(B1), uint8.to_int(B0)]
        ).

That's fine, otherwise.

Peter


More information about the reviews mailing list