[m-rev.] for review: add more character classification predicates
Peter Wang
novalazy at gmail.com
Tue Jul 25 16:58:11 AEST 2017
On Mon, 24 Jul 2017 21:51:20 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
>
> For review by anyone.
>
> I've found myself needing these enough times recently while cleaning
> inputs etc that they may as well be in the standard library.
>
> ---------------------
>
> Add more character classification predicates.
>
> library/char.m:
> Add the predicates is_control/1, is_space_separator/1,
> is_paragraph_separator/1, is_line_separator/1 and is_private_use/1.
>
> Reword some documentation from "Succeed if ..." to "True iff ...".
> This is more consistent with the other documentation in this module.
>
> NEWS:
> Announce the additions.
>
> Julien.
>
> diff --git a/NEWS b/NEWS
> index 568dea54a..940aa3c78 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -278,6 +278,8 @@ Changes to the Mercury standard library:
> - hex_digit_to_int/2, det_hex_digit_to_int/1
> - base_digit_to_int/3, det_base_digit_to_int/2
> - is_leading_surrogate/1, is_trailing_surrogate/1
> + - is_control/1, is_space_separator/1, is_paragraph_separator/1
> + - is_line_separator/1, is_private_use/1
>
> The following predicates in the char module have been deprecated and will
> either be removed or have their semantics changed in a future release:
> diff --git a/library/char.m b/library/char.m
> index 3431bf5e6..b200b7f17 100644
> --- a/library/char.m
> +++ b/library/char.m
> @@ -276,31 +276,57 @@
> %
> :- pred to_utf16(char::in, list(int)::out) is semidet.
>
> - % Succeed if `Char' is a Unicode surrogate code point.
> - % In UTF-16, a code point with a scalar value greater than 0xffff
> - % is encoded with a pair of surrogate code points.
> + % True iff the character is a Unicode Surrogate code point, that is a code
> + % point in General Category `Other,surrogate' (`Cs').
> + % In UTF-16, a code point with a scalar value greater than 0xffff is
> + % encoded with a pair of surrogate code points.
> %
> :- pred is_surrogate(char::in) is semidet.
>
> - % Succeed if `Char' is a leading Unicode surrogate code point.
> + % True iff the character is a Unicode leading surrogate code point.
> % A leading surrogate code point is in the inclusive range from
> % 0xd800 to 0xdbff.
> %
> :- pred is_leading_surrogate(char::in) is semidet.
>
> - % Succeed if `Char' is a trailing Unicode surrogate code point.
> + % True iff the character is a Unicode trailing surrogate code point.
> % A trailing surrogate code point is in the inclusive range from
> % 0xdc00 to 0xdfff.
> %
> :- pred is_trailing_surrogate(char::in) is semidet.
>
> - % Succeed if `Char' is a Noncharacter code point.
> + % True iff the character is a Unicode Noncharacter code point.
> % Sixty-six code points are not used to encode characters.
> % These code points should not be used for interchange, but may be used
> % internally.
> %
> :- pred is_noncharacter(char::in) is semidet.
>
> + % True iff the character is a Unicode Control code point, that is a code
> + % point in General Category `Other,control' (`Cc').
> + %
> +:- pred is_control(char::in) is semidet.
> +
> + % True iff the character is a Unicode Space Separator code point, that is a
> + % code point in General Category `Separator,space' (`Zs').
> + %
> +:- pred is_space_separator(char::in) is semidet.
Is this category subject to extension? Maybe mention the Unicode
version.
> +
> + % True iff the character is a Unicode Line Separator code point, that is a
> + % code point in General Category `Separator,line' (`Zl').
> + %
> +:- pred is_line_separator(char::in) is semidet.
> +
> + % True iff the character is a Unicode Paragraph Separator code point, that
> + % is a code point in General Category `Separator,paragraph' (`Zp').
> + %
> +:- pred is_paragraph_separator(char::in) is semidet.
> +
> + % True iff the character is a Unicode Private-use code point, that is a
> + % code point in General Category `Other,private use' (`Co').
> + %
> +:- pred is_private_use(char::in) is semidet.
> +
> %---------------------------------------------------------------------------%
>
> % Convert a char to a pretty_printer.doc for formatting.
> @@ -324,7 +350,7 @@
> :- pred int_to_hex_char(int, char).
> :- mode int_to_hex_char(in, out) is semidet.
>
> - % Succeeds if char is a decimal digit (0-9) or letter (a-z or A-Z).
> + % True iff the characters is a decimal digit (0-9) or letter (a-z or A-Z).
Double space.
> % Returns the character's value as a digit (0-9 or 10-35).
> %
> :- pragma obsolete(digit_to_int/2).
> @@ -1019,6 +1045,36 @@ is_noncharacter(Char) :-
> ; Int /\ 0xfffe = 0xfffe
> ).
>
> +is_control(Char) :-
> + Int = char.to_int(Char),
> + ( 0x0000 =< Int, Int =< 0x001f
> + ; 0x007f =< Int, Int =< 0x009f
> + ).
> +
> +is_space_separator(Char) :-
> + Int = char.to_int(Char),
> + ( Int = 0x0020
> + ; Int = 0x00a0
> + ; Int = 0x1680
> + ; Int =< 0x2000, Int =< 0x200a
That's not right.
> + ; Int = 0x202f
> + ; Int = 0x205f
> + ; Int = 0x3000
> + ).
> +
Peter
More information about the reviews
mailing list