[m-rev.] for review: encoding chars as uint8s or uint16s

Julien Fischer jfischer at opturion.com
Tue Jul 20 14:58:56 AEST 2021


For review by anyone.

There's one other thing I would like feedback on here.  All of these
predicates currently fail when the input is either a surrogate (which
is fine) or if the character is outside the valid Unicode code point
range (0x0 .. 0x10ffff).  I think the latter case should cause an
exception to be thrown.

--------------------------

Encoding chars as uint8s or uint16s.

library/char.m:
     Add to_utf8_uint8/2 and to_utf16_uint16/2. These do the
     same thing as the existing to_utf8/2 and to_utf16/2 predicates,
     but return the code units as uint8s or uint16s respectively.

     Update the comment at the head of the module about which version
     of the Unicode standard this module conforms to.
     (For the purposes of this module, nothing has changed between
     versions 10 and 13.)

NEWS:
     Announce the additions.

Julien.

diff --git a/NEWS b/NEWS
index 70664ca..1a56c90 100644
--- a/NEWS
+++ b/NEWS
@@ -625,6 +625,8 @@ Changes to the Mercury standard library

      - func `hash/1`
      - pred `hash/2`
+    - pred `to_utf8_uint8/2`
+    - pred `to_utf16_uint16/2`

  ### Changes to the `float` module

diff --git a/library/char.m b/library/char.m
index bce9809..1d84da5 100644
--- a/library/char.m
+++ b/library/char.m
@@ -2,7 +2,7 @@
  % vim: ft=mercury ts=4 sw=4 et
  %---------------------------------------------------------------------------%
  % Copyright (C) 1994-2008, 2011 The University of Melbourne.
-% Copyright (C) 2013-2015, 2017-2018 The Mercury team.
+% Copyright (C) 2013-2015, 2017-2021 The Mercury team.
  % This file is distributed under the terms specified in COPYING.LIB.
  %---------------------------------------------------------------------------%
  %
@@ -17,7 +17,7 @@
  % But now we use `char' and the use of `character' is discouraged.
  %
  % All predicates and functions exported by this module that deal with
-% Unicode conform to version 10 of the Unicode standard.
+% Unicode conform to version 13 of the Unicode standard.
  %
  %---------------------------------------------------------------------------%
  %---------------------------------------------------------------------------%
@@ -286,11 +286,19 @@
      %
  :- pred to_utf8(char::in, list(int)::out) is semidet.

+    % As above, but represent UTF-8 code units using uint8s.
+    %
+:- pred to_utf8_uint8(char::in, list(uint8)::out) is semidet.
+
      % Encode a Unicode code point in UTF-16 (native endianness).
      % Fails for surrogate code points.
      %
  :- pred to_utf16(char::in, list(int)::out) is semidet.

+    % As above, but represent UTF-16 code units using uint16s.
+    %
+:- pred to_utf16_uint16(char::in, list(uint16)::out) is semidet.
+
      % True iff the character is a Unicode Surrogate code point, that is a code
      % point in General Category `Other,surrogate' (`Cs').
      % In UTF-16, a code point with a scalar value greater than 0xffff is
@@ -384,6 +392,8 @@
  :- import_module require.
  :- import_module term_io.
  :- import_module uint.
+:- import_module uint16.
+:- import_module uint8.

  :- instance enum(character) where [
      (to_int(X) = Y :-
@@ -975,6 +985,32 @@ to_utf8(Char, CodeUnits) :-
          fail
      ).

+to_utf8_uint8(Char, CodeUnits) :-
+    Int = char.to_int(Char),
+    ( if Int =< 0x7f then
+        A = uint8.cast_from_int(Int),
+        CodeUnits = [A]
+    else if Int =< 0x7ff then
+        A = uint8.cast_from_int(0xc0 \/ ((Int >> 6) /\ 0x1f)),
+        B = uint8.cast_from_int(0x80 \/  (Int       /\ 0x3f)),
+        CodeUnits = [A, B]
+    else if Int =< 0xffff then
+        not is_surrogate(Char),
+        A = uint8.cast_from_int(0xe0 \/ ((Int >> 12) /\ 0x0f)),
+        B = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
+        C = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
+        CodeUnits = [A, B, C]
+    else if Int =< 0x10ffff then
+        A = uint8.cast_from_int(0xf0 \/ ((Int >> 18) /\ 0x07)),
+        B = uint8.cast_from_int(0x80 \/ ((Int >> 12) /\ 0x3f)),
+        C = uint8.cast_from_int(0x80 \/ ((Int >>  6) /\ 0x3f)),
+        D = uint8.cast_from_int(0x80 \/  (Int        /\ 0x3f)),
+        CodeUnits = [A, B, C, D]
+    else
+        % Illegal code point.
+        fail
+    ).
+
  to_utf16(Char, CodeUnits) :-
      Int = char.to_int(Char),
      ( if Int < 0xd800 then
@@ -995,6 +1031,28 @@ to_utf16(Char, CodeUnits) :-
          fail
      ).

+to_utf16_uint16(Char, CodeUnits) :-
+    Int = char.to_int(Char),
+    ( if Int < 0xd800 then
+        % Common case.
+        A = uint16.cast_from_int(Int),
+        CodeUnits = [A]
+    else if Int =< 0xdfff then
+        % Surrogate.
+        fail
+    else if Int =< 0xffff then
+        A = uint16.cast_from_int(Int),
+        CodeUnits = [A]
+    else if Int =< 0x10ffff then
+        U = Int - 0x10000,
+        A = uint16.cast_from_int(0xd800 \/ (U >> 10)),
+        B = uint16.cast_from_int(0xdc00 \/ (U /\ 0x3ff)),
+        CodeUnits = [A, B]
+    else
+        % Illegal code point.
+        fail
+    ).
+
  is_surrogate(Char) :-
      Int = char.to_int(Char),
      Int >= 0xd800,


More information about the reviews mailing list