[m-rev.] for review: add more character classification predicates

Julien Fischer jfischer at opturion.com
Mon Jul 24 21:51:20 AEST 2017


For review by anyone.

I've found myself needing these enough times recently while cleaning
inputs etc that they may as well be in the standard library.

---------------------

Add more character classification predicates.

library/char.m:
      Add the predicates is_control/1, is_space_separator/1,
      is_paragraph_separator/1, is_line_separator/1 and is_private_use/1.

      Reword some documentation from "Succeed if ..." to "True iff ...".
      This is more consistent with the other documentation in this module.

NEWS:
      Announce the additions.

Julien.

diff --git a/NEWS b/NEWS
index 568dea54a..940aa3c78 100644
--- a/NEWS
+++ b/NEWS
@@ -278,6 +278,8 @@ Changes to the Mercury standard library:
      - hex_digit_to_int/2, det_hex_digit_to_int/1
      - base_digit_to_int/3, det_base_digit_to_int/2
      - is_leading_surrogate/1, is_trailing_surrogate/1
+    - is_control/1, is_space_separator/1, is_paragraph_separator/1
+    - is_line_separator/1, is_private_use/1

    The following predicates in the char module have been deprecated and will
    either be removed or have their semantics changed in a future release:
diff --git a/library/char.m b/library/char.m
index 3431bf5e6..b200b7f17 100644
--- a/library/char.m
+++ b/library/char.m
@@ -276,31 +276,57 @@
      %
  :- pred to_utf16(char::in, list(int)::out) is semidet.

-    % Succeed if `Char' is a Unicode surrogate code point.
-    % In UTF-16, a code point with a scalar value greater than 0xffff
-    % is encoded with a pair of surrogate code points.
+    % True iff the character is a Unicode Surrogate code point, that is a code
+    % point in General Category `Other,surrogate' (`Cs').
+    % In UTF-16, a code point with a scalar value greater than 0xffff is
+    % encoded with a pair of surrogate code points.
      %
  :- pred is_surrogate(char::in) is semidet.

-    % Succeed if `Char' is a leading Unicode surrogate code point.
+    % True iff the character is a Unicode leading surrogate code point.
      % A leading surrogate code point is in the inclusive range from
      % 0xd800 to 0xdbff.
      %
  :- pred is_leading_surrogate(char::in) is semidet.

-    % Succeed if `Char' is a trailing Unicode surrogate code point.
+    % True iff the character is a Unicode trailing surrogate code point.
      % A trailing surrogate code point is in the inclusive range from
      % 0xdc00 to 0xdfff.
      %
  :- pred is_trailing_surrogate(char::in) is semidet.

-    % Succeed if `Char' is a Noncharacter code point.
+    % True iff the character is a Unicode Noncharacter code point.
      % Sixty-six code points are not used to encode characters.
      % These code points should not be used for interchange, but may be used
      % internally.
      %
  :- pred is_noncharacter(char::in) is semidet.

+    % True iff the character is a Unicode Control code point, that is a code
+    % point in General Category `Other,control' (`Cc').
+    %
+:- pred is_control(char::in) is semidet.
+
+    % True iff the character is a Unicode Space Separator code point, that is a
+    % code point in General Category `Separator,space' (`Zs').
+    %
+:- pred is_space_separator(char::in) is semidet.
+
+    % True iff the character  is a Unicode Line Separator code point, that is a
+    % code point in General Category `Separator,line' (`Zl').
+    %
+:- pred is_line_separator(char::in) is semidet.
+
+    % True iff the character is a Unicode Paragraph Separator code point, that
+    % is a code point in General Category `Separator,paragraph' (`Zp').
+    %
+:- pred is_paragraph_separator(char::in) is semidet.
+
+    % True iff the character is a Unicode Private-use code point, that is a
+    % code point in General Category `Other,private use' (`Co').
+    %
+:- pred is_private_use(char::in) is semidet.
+
  %---------------------------------------------------------------------------%

      % Convert a char to a pretty_printer.doc for formatting.
@@ -324,7 +350,7 @@
  :- pred int_to_hex_char(int, char).
  :- mode int_to_hex_char(in, out) is semidet.

-    % Succeeds if char is a decimal digit (0-9) or letter (a-z or A-Z).
+    % True iff the characters  is a decimal digit (0-9) or letter (a-z or A-Z).
      % Returns the character's value as a digit (0-9 or 10-35).
      %
  :- pragma obsolete(digit_to_int/2).
@@ -1019,6 +1045,36 @@ is_noncharacter(Char) :-
      ; Int /\ 0xfffe = 0xfffe
      ).

+is_control(Char) :-
+    Int = char.to_int(Char),
+    ( 0x0000 =< Int, Int =< 0x001f
+    ; 0x007f =< Int, Int =< 0x009f
+    ).
+
+is_space_separator(Char) :-
+    Int = char.to_int(Char),
+    ( Int = 0x0020
+    ; Int = 0x00a0
+    ; Int = 0x1680
+    ; Int =< 0x2000, Int =< 0x200a
+    ; Int = 0x202f
+    ; Int = 0x205f
+    ; Int = 0x3000
+    ).
+
+is_line_separator(Char) :-
+    0x2028 = char.to_int(Char).
+
+is_paragraph_separator(Char) :-
+    0x2029 = char.to_int(Char).
+
+is_private_use(Char) :-
+    Int = char.to_int(Char),
+    ( 0xe000 =< Int, Int =< 0xf8ff     % Private Use Area.
+    ; 0xf0000 =< Int, Int =< 0xffffd   % Supplemental Private Use Area-A.
+    ; 0x100000 =< Int, Int =< 0x10fffd % Supplemental Private Use Area-B.
+    ).
+
  char_to_doc(C) = str(term_io.quoted_char(C)).

  %---------------------------------------------------------------------------%


More information about the reviews mailing list