[m-rev.] for review: add more character classification predicates

Peter Wang novalazy at gmail.com
Tue Jul 25 16:58:11 AEST 2017


On Mon, 24 Jul 2017 21:51:20 +1000 (AEST), Julien Fischer <jfischer at opturion.com> wrote:
> 
> For review by anyone.
> 
> I've found myself needing these enough times recently while cleaning
> inputs etc that they may as well be in the standard library.
> 
> ---------------------
> 
> Add more character classification predicates.
> 
> library/char.m:
>       Add the predicates is_control/1, is_space_separator/1,
>       is_paragraph_separator/1, is_line_separator/1 and is_private_use/1.
> 
>       Reword some documentation from "Succeed if ..." to "True iff ...".
>       This is more consistent with the other documentation in this module.
> 
> NEWS:
>       Announce the additions.
> 
> Julien.
> 
> diff --git a/NEWS b/NEWS
> index 568dea54a..940aa3c78 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -278,6 +278,8 @@ Changes to the Mercury standard library:
>       - hex_digit_to_int/2, det_hex_digit_to_int/1
>       - base_digit_to_int/3, det_base_digit_to_int/2
>       - is_leading_surrogate/1, is_trailing_surrogate/1
> +    - is_control/1, is_space_separator/1, is_paragraph_separator/1
> +    - is_line_separator/1, is_private_use/1
> 
>     The following predicates in the char module have been deprecated and will
>     either be removed or have their semantics changed in a future release:

> diff --git a/library/char.m b/library/char.m
> index 3431bf5e6..b200b7f17 100644
> --- a/library/char.m
> +++ b/library/char.m
> @@ -276,31 +276,57 @@
>       %
>   :- pred to_utf16(char::in, list(int)::out) is semidet.
> 
> -    % Succeed if `Char' is a Unicode surrogate code point.
> -    % In UTF-16, a code point with a scalar value greater than 0xffff
> -    % is encoded with a pair of surrogate code points.
> +    % True iff the character is a Unicode Surrogate code point, that is a code
> +    % point in General Category `Other,surrogate' (`Cs').
> +    % In UTF-16, a code point with a scalar value greater than 0xffff is
> +    % encoded with a pair of surrogate code points.
>       %
>   :- pred is_surrogate(char::in) is semidet.
> 
> -    % Succeed if `Char' is a leading Unicode surrogate code point.
> +    % True iff the character is a Unicode leading surrogate code point.
>       % A leading surrogate code point is in the inclusive range from
>       % 0xd800 to 0xdbff.
>       %
>   :- pred is_leading_surrogate(char::in) is semidet.
> 
> -    % Succeed if `Char' is a trailing Unicode surrogate code point.
> +    % True iff the character is a Unicode trailing surrogate code point.
>       % A trailing surrogate code point is in the inclusive range from
>       % 0xdc00 to 0xdfff.
>       %
>   :- pred is_trailing_surrogate(char::in) is semidet.
> 

> -    % Succeed if `Char' is a Noncharacter code point.
> +    % True iff the character is a Unicode Noncharacter code point.
>       % Sixty-six code points are not used to encode characters.
>       % These code points should not be used for interchange, but may be used
>       % internally.
>       %
>   :- pred is_noncharacter(char::in) is semidet.
> 

> +    % True iff the character is a Unicode Control code point, that is a code
> +    % point in General Category `Other,control' (`Cc').
> +    %
> +:- pred is_control(char::in) is semidet.
> +
> +    % True iff the character is a Unicode Space Separator code point, that is a
> +    % code point in General Category `Separator,space' (`Zs').
> +    %
> +:- pred is_space_separator(char::in) is semidet.

Is this category subject to extension? Maybe mention the Unicode
version.

> +
> +    % True iff the character  is a Unicode Line Separator code point, that is a
> +    % code point in General Category `Separator,line' (`Zl').
> +    %
> +:- pred is_line_separator(char::in) is semidet.
> +
> +    % True iff the character is a Unicode Paragraph Separator code point, that
> +    % is a code point in General Category `Separator,paragraph' (`Zp').
> +    %
> +:- pred is_paragraph_separator(char::in) is semidet.
> +
> +    % True iff the character is a Unicode Private-use code point, that is a
> +    % code point in General Category `Other,private use' (`Co').
> +    %
> +:- pred is_private_use(char::in) is semidet.
> +
>   %---------------------------------------------------------------------------%
> 
>       % Convert a char to a pretty_printer.doc for formatting.
> @@ -324,7 +350,7 @@
>   :- pred int_to_hex_char(int, char).
>   :- mode int_to_hex_char(in, out) is semidet.
> 
> -    % Succeeds if char is a decimal digit (0-9) or letter (a-z or A-Z).
> +    % True iff the characters  is a decimal digit (0-9) or letter (a-z or A-Z).

Double space.

>       % Returns the character's value as a digit (0-9 or 10-35).
>       %
>   :- pragma obsolete(digit_to_int/2).
> @@ -1019,6 +1045,36 @@ is_noncharacter(Char) :-
>       ; Int /\ 0xfffe = 0xfffe
>       ).
> 
> +is_control(Char) :-
> +    Int = char.to_int(Char),
> +    ( 0x0000 =< Int, Int =< 0x001f
> +    ; 0x007f =< Int, Int =< 0x009f
> +    ).
> +
> +is_space_separator(Char) :-
> +    Int = char.to_int(Char),
> +    ( Int = 0x0020
> +    ; Int = 0x00a0
> +    ; Int = 0x1680
> +    ; Int =< 0x2000, Int =< 0x200a

That's not right.

> +    ; Int = 0x202f
> +    ; Int = 0x205f
> +    ; Int = 0x3000
> +    ).
> +

Peter


More information about the reviews mailing list