[m-rev.] for review: encoding chars as uint8s or uint16s
Peter Wang
novalazy at gmail.com
Tue Jul 20 18:08:30 AEST 2021
On Tue, 20 Jul 2021 14:58:56 +1000 Julien Fischer <jfischer at opturion.com> wrote:
>
> For review by anyone.
>
> There's one other thing I would like feedback on here. All of these
> predicates currently fail when the input is either a surrogate (which
> is fine) or if the character is outside the valid Unicode code point
> range (0x0 .. 0x10ffff). I think the latter case should cause an
> exception to be thrown.
>
I don't exactly object to throwing an exception, but chars are always
supposed to be in the Unicode range. Obviously, it's possible to
introduce a bad char via the FFI.
> --------------------------
>
> Encoding chars as uint8s or uint16s.
>
> library/char.m:
> Add to_utf8_uint8/2 and to_utf16_uint16/2. These do the
> same thing as the existing to_utf8/2 and to_utf16/2 predicates,
> but return the code units as uint8s or uint16s respectively.
>
> Update the comment at the head of the module about which version
> of the Unicode standard this module conforms to.
> (For the purposes of this module, nothing has changed between
> versions 10 and 13.)
>
> NEWS:
> Announce the additions.
>
> Julien.
> diff --git a/library/char.m b/library/char.m
> index bce9809..1d84da5 100644
...
> @@ -975,6 +985,32 @@ to_utf8(Char, CodeUnits) :-
> fail
> ).
>
> +to_utf8_uint8(Char, CodeUnits) :-
> + Int = char.to_int(Char),
> + ( if Int =< 0x7f then
> + A = uint8.cast_from_int(Int),
> + CodeUnits = [A]
> + else if Int =< 0x7ff then
> + A = uint8.cast_from_int(0xc0 \/ ((Int >> 6) /\ 0x1f)),
> + B = uint8.cast_from_int(0x80 \/ (Int /\ 0x3f)),
> + CodeUnits = [A, B]
> + else if Int =< 0xffff then
> + not is_surrogate(Char),
> + A = uint8.cast_from_int(0xe0 \/ ((Int >> 12) /\ 0x0f)),
> + B = uint8.cast_from_int(0x80 \/ ((Int >> 6) /\ 0x3f)),
> + C = uint8.cast_from_int(0x80 \/ (Int /\ 0x3f)),
> + CodeUnits = [A, B, C]
> + else if Int =< 0x10ffff then
> + A = uint8.cast_from_int(0xf0 \/ ((Int >> 18) /\ 0x07)),
> + B = uint8.cast_from_int(0x80 \/ ((Int >> 12) /\ 0x3f)),
> + C = uint8.cast_from_int(0x80 \/ ((Int >> 6) /\ 0x3f)),
> + D = uint8.cast_from_int(0x80 \/ (Int /\ 0x3f)),
> + CodeUnits = [A, B, C, D]
> + else
> + % Illegal code point.
> + fail
> + ).
> +
Perhaps use a shared predicate to avoid the code duplication:
to_utf8(Char, CodeUnits) :-
to_utf8_code_units(Char, Length, B0, B1, B2, B3),
(
Length = 1,
CodeUnits = [uint8.to_int(B0)]
;
Length = 2,
CodeUnits = [uint8.to_int(B1), uint8.to_int(B0)]
;
Length = 3,
CodeUnits = [uint8.to_int(B2), uint8.to_int(B1), uint8.to_int(B0)]
;
Length = 4,
CodeUnits = [uint8.to_int(B3), uint8.to_int(B2),
uint8.to_int(B1), uint8.to_int(B0)]
).
That's fine, otherwise.
Peter
More information about the reviews
mailing list