[m-rev.] for review: add more character classification predicates
Julien Fischer
jfischer at opturion.com
Mon Jul 24 21:51:20 AEST 2017
For review by anyone.
I've found myself needing these enough times recently while cleaning
inputs etc that they may as well be in the standard library.
---------------------
Add more character classification predicates.
library/char.m:
Add the predicates is_control/1, is_space_separator/1,
is_paragraph_separator/1, is_line_separator/1 and is_private_use/1.
Reword some documentation from "Succeed if ..." to "True iff ...".
This is more consistent with the other documentation in this module.
NEWS:
Announce the additions.
Julien.
diff --git a/NEWS b/NEWS
index 568dea54a..940aa3c78 100644
--- a/NEWS
+++ b/NEWS
@@ -278,6 +278,8 @@ Changes to the Mercury standard library:
- hex_digit_to_int/2, det_hex_digit_to_int/1
- base_digit_to_int/3, det_base_digit_to_int/2
- is_leading_surrogate/1, is_trailing_surrogate/1
+ - is_control/1, is_space_separator/1, is_paragraph_separator/1
+ - is_line_separator/1, is_private_use/1
The following predicates in the char module have been deprecated and will
either be removed or have their semantics changed in a future release:
diff --git a/library/char.m b/library/char.m
index 3431bf5e6..b200b7f17 100644
--- a/library/char.m
+++ b/library/char.m
@@ -276,31 +276,57 @@
%
:- pred to_utf16(char::in, list(int)::out) is semidet.
- % Succeed if `Char' is a Unicode surrogate code point.
- % In UTF-16, a code point with a scalar value greater than 0xffff
- % is encoded with a pair of surrogate code points.
+ % True iff the character is a Unicode Surrogate code point, that is a code
+ % point in General Category `Other,surrogate' (`Cs').
+ % In UTF-16, a code point with a scalar value greater than 0xffff is
+ % encoded with a pair of surrogate code points.
%
:- pred is_surrogate(char::in) is semidet.
- % Succeed if `Char' is a leading Unicode surrogate code point.
+ % True iff the character is a Unicode leading surrogate code point.
% A leading surrogate code point is in the inclusive range from
% 0xd800 to 0xdbff.
%
:- pred is_leading_surrogate(char::in) is semidet.
- % Succeed if `Char' is a trailing Unicode surrogate code point.
+ % True iff the character is a Unicode trailing surrogate code point.
% A trailing surrogate code point is in the inclusive range from
% 0xdc00 to 0xdfff.
%
:- pred is_trailing_surrogate(char::in) is semidet.
- % Succeed if `Char' is a Noncharacter code point.
+ % True iff the character is a Unicode Noncharacter code point.
% Sixty-six code points are not used to encode characters.
% These code points should not be used for interchange, but may be used
% internally.
%
:- pred is_noncharacter(char::in) is semidet.
+ % True iff the character is a Unicode Control code point, that is a code
+ % point in General Category `Other,control' (`Cc').
+ %
+:- pred is_control(char::in) is semidet.
+
+ % True iff the character is a Unicode Space Separator code point, that is a
+ % code point in General Category `Separator,space' (`Zs').
+ %
+:- pred is_space_separator(char::in) is semidet.
+
+ % True iff the character is a Unicode Line Separator code point, that is a
+ % code point in General Category `Separator,line' (`Zl').
+ %
+:- pred is_line_separator(char::in) is semidet.
+
+ % True iff the character is a Unicode Paragraph Separator code point, that
+ % is a code point in General Category `Separator,paragraph' (`Zp').
+ %
+:- pred is_paragraph_separator(char::in) is semidet.
+
+ % True iff the character is a Unicode Private-use code point, that is a
+ % code point in General Category `Other,private use' (`Co').
+ %
+:- pred is_private_use(char::in) is semidet.
+
%---------------------------------------------------------------------------%
% Convert a char to a pretty_printer.doc for formatting.
@@ -324,7 +350,7 @@
:- pred int_to_hex_char(int, char).
:- mode int_to_hex_char(in, out) is semidet.
- % Succeeds if char is a decimal digit (0-9) or letter (a-z or A-Z).
+ % True iff the characters is a decimal digit (0-9) or letter (a-z or A-Z).
% Returns the character's value as a digit (0-9 or 10-35).
%
:- pragma obsolete(digit_to_int/2).
@@ -1019,6 +1045,36 @@ is_noncharacter(Char) :-
; Int /\ 0xfffe = 0xfffe
).
+is_control(Char) :-
+ Int = char.to_int(Char),
+ ( 0x0000 =< Int, Int =< 0x001f
+ ; 0x007f =< Int, Int =< 0x009f
+ ).
+
+is_space_separator(Char) :-
+ Int = char.to_int(Char),
+ ( Int = 0x0020
+ ; Int = 0x00a0
+ ; Int = 0x1680
+ ; Int =< 0x2000, Int =< 0x200a
+ ; Int = 0x202f
+ ; Int = 0x205f
+ ; Int = 0x3000
+ ).
+
+is_line_separator(Char) :-
+ 0x2028 = char.to_int(Char).
+
+is_paragraph_separator(Char) :-
+ 0x2029 = char.to_int(Char).
+
+is_private_use(Char) :-
+ Int = char.to_int(Char),
+ ( 0xe000 =< Int, Int =< 0xf8ff % Private Use Area.
+ ; 0xf0000 =< Int, Int =< 0xffffd % Supplemental Private Use Area-A.
+ ; 0x100000 =< Int, Int =< 0x10fffd % Supplemental Private Use Area-B.
+ ).
+
char_to_doc(C) = str(term_io.quoted_char(C)).
%---------------------------------------------------------------------------%
More information about the reviews
mailing list