[m-rev.] for review: Check for surrogates when converting list of char to string.
Mark Brown
mark at mercurylang.org
Tue Oct 29 18:59:45 AEDT 2019
This looks fine.
On Tue, Oct 29, 2019 at 12:02 PM Peter Wang <novalazy at gmail.com> wrote:
> library/string.m:
> Make from_char_list, from_rev_char_list, to_char_list throw an
> exception if the list of chars includes a surrogate code point that
> cannot be encoded in a UTF-8 string.
>
> Make semidet_from_char_list, semidet_from_rev_char_list,
> to_char_list fail if the list of chars includes a surrogate code
> point that cannot be encoded in a UTF-8 string.
>
> runtime/mercury_string.h:
> Document return value of MR_utf8_width.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_from_char_list_ilseq.exp:
> tests/hard_coded/string_from_char_list_ilseq.exp2:
> tests/hard_coded/string_from_char_list_ilseq.m:
> Add test case.
>
> tests/hard_coded/null_char.exp:
> Expect new message in exceptions thrown by from_char_list,
> from_rev_char_list.
>
> tests/hard_coded/string_hash.m:
> Don't generate surrogate code points in random strings.
> ---
> library/string.m | 162 +++++++++---------
> runtime/mercury_string.h | 1 +
> tests/hard_coded/Mmakefile | 1 +
> tests/hard_coded/null_char.exp | 4 +-
> .../string_from_char_list_ilseq.exp | 15 ++
> .../string_from_char_list_ilseq.exp2 | 15 ++
> .../hard_coded/string_from_char_list_ilseq.m | 88 ++++++++++
> tests/hard_coded/string_hash.m | 7 +-
> 8 files changed, 211 insertions(+), 82 deletions(-)
> create mode 100644 tests/hard_coded/string_from_char_list_ilseq.exp
> create mode 100644 tests/hard_coded/string_from_char_list_ilseq.exp2
> create mode 100644 tests/hard_coded/string_from_char_list_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index ab19974d2..cbdbcd128 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -131,10 +131,10 @@
> % If strings use UTF-16 encoding then each unpaired surrogate code
> point
> % is returned as a separate code point in the list.
> %
> - % The reverse mode of the predicate throws an exception if
> - % the list of characters contains a null character.
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % The reverse mode of the predicate throws an exception if the list
> + % contains a null character or code point that cannot be encoded in a
> + % string (namely, surrogate code points cannot be encoded in UTF-8
> + % strings).
> %
> % The reverse mode of to_char_list/2 is deprecated because the implied
> % ability to round trip convert a string to a list then back to the
> same
> @@ -155,10 +155,10 @@
> % If strings use UTF-16 encoding then each unpaired surrogate code
> point
> % is returned as a separate code point in the list.
> %
> - % The reverse mode of the predicate throws an exception if
> - % the list of characters contains a null character.
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % The reverse mode of the predicate throws an exception if the list
> + % contains a null character or code point that cannot be encoded in a
> + % string (namely, surrogate code points cannot be encoded in UTF-8
> + % strings).
> %
> % The reverse mode of to_rev_char_list/2 is deprecated because the
> implied
> % ability to round trip convert a string to a list then back to the
> same
> @@ -171,10 +171,9 @@
> :- mode to_rev_char_list(uo, in) is det.
>
> % Convert a list of characters (code points) to a string.
> - % Throws an exception if the list of characters contains a null
> character.
> - %
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % Throws an exception if the list contains a null character or code
> point
> + % that cannot be encoded in a string (namely, surrogate code points
> cannot
> + % be encoded in UTF-8 strings).
> %
> % The forward mode of from_char_list/2 is deprecated because the
> implied
> % ability to round trip convert a string to a list then back to the
> same
> @@ -187,28 +186,21 @@
> :- mode from_char_list(out, in) is det.
>
> % As above, but fail instead of throwing an exception if the list
> contains
> - % a null character.
> - %
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % a null character or code point that cannot be encoded in a string.
> %
> :- pred semidet_from_char_list(list(char)::in, string::uo) is semidet.
>
> % Same as from_char_list, except that it reverses the order
> % of the characters.
> - % Throws an exception if the list of characters contains a null
> character.
> - %
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % Throws an exception if the list contains a null character or code
> point
> + % that cannot be encoded in a string (namely, surrogate code points
> cannot
> + % be encoded in UTF-8 strings).
> %
> :- func from_rev_char_list(list(char)::in) = (string::uo) is det.
> :- pred from_rev_char_list(list(char)::in, string::uo) is det.
>
> % As above, but fail instead of throwing an exception if the list
> contains
> - % a null character.
> - %
> - % NOTE: In the future we may also throw an exception if the list
> contains
> - % a surrogate code point.
> + % a null character or code point that cannot be encoded in a string.
> %
> :- pred semidet_from_rev_char_list(list(char)::in, string::uo) is semidet.
>
> @@ -1696,7 +1688,7 @@ from_char_list(Chars::in, Str::uo) :-
> ( if semidet_from_char_list(Chars, Str0) then
> Str = Str0
> else
> - unexpected($pred, "null character in list")
> + unexpected($pred, "null character or surrogate code point in
> list")
> ).
>
> :- pragma foreign_proc("C",
> @@ -1704,51 +1696,55 @@ from_char_list(Chars::in, Str::uo) :-
> [will_not_call_mercury, promise_pure, thread_safe,
> will_not_modify_trail,
> does_not_affect_liveness, may_not_duplicate, no_sharing],
> "{
> - // mode (uo, in) is det
> MR_Word char_list_ptr;
> size_t size;
> - MR_Char c;
>
> // Loop to calculate list length + sizeof(MR_Word) in `size'
> // using list in `char_list_ptr'.
> - size = 0;
> - char_list_ptr = CharList;
> - while (! MR_list_is_empty(char_list_ptr)) {
> - c = (MR_Char) MR_list_head(char_list_ptr);
> - if (MR_is_ascii(c)) {
> - size++;
> - } else {
> - // XXX ILSEQ Do something if c is a surrogate code point.
> - size += MR_utf8_width(c);
> - }
> - char_list_ptr = MR_list_tail(char_list_ptr);
> - }
> -
> - // Allocate heap space for string.
> - MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);
> -
> - // Loop to copy the characters from the char_list to the string.
> SUCCESS_INDICATOR = MR_TRUE;
> size = 0;
> char_list_ptr = CharList;
> while (! MR_list_is_empty(char_list_ptr)) {
> - c = (MR_Char) MR_list_head(char_list_ptr);
> - // It is an error to put a null character in a string
> - // (see the comments at the top of this file).
> + MR_Char c = (MR_Char) MR_list_head(char_list_ptr);
> if (c == '\\0') {
> SUCCESS_INDICATOR = MR_FALSE;
> break;
> }
> if (MR_is_ascii(c)) {
> - Str[size] = c;
> size++;
> } else {
> - size += MR_utf8_encode(Str + size, c);
> + size_t csize = MR_utf8_width(c);
> + if (csize == 0) {
> + // c is a surrogate code point (or even out of range,
> + // but that is not supposed to happen).
> + SUCCESS_INDICATOR = MR_FALSE;
> + break;
> + }
> + size += csize;
> }
> char_list_ptr = MR_list_tail(char_list_ptr);
> }
>
> - Str[size] = '\\0';
> + if (SUCCESS_INDICATOR) {
> + // Allocate heap space for string.
> + MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);
> +
> + // Loop to copy the characters from the char_list to the string.
> + size = 0;
> + char_list_ptr = CharList;
> + while (! MR_list_is_empty(char_list_ptr)) {
> + MR_Char c = (MR_Char) MR_list_head(char_list_ptr);
> + if (MR_is_ascii(c)) {
> + Str[size] = c;
> + size++;
> + } else {
> + size += MR_utf8_encode(Str + size, c);
> + }
> + char_list_ptr = MR_list_tail(char_list_ptr);
> + }
> +
> + Str[size] = '\\0';
> + }
> }").
> :- pragma foreign_proc("C#",
> semidet_from_char_list(CharList::in, Str::uo),
> @@ -1808,6 +1804,7 @@ semidet_from_char_list(CharList, Str) :-
> ;
> CharList = [C | Cs],
> not char.to_int(C, 0),
> + internal_encoding_is_utf8 => not char.is_surrogate(C),
> semidet_from_char_list(Cs, Str0),
> first_char(Str, C, Str0)
> ).
> @@ -1827,7 +1824,7 @@ from_rev_char_list(Chars, Str) :-
> ( if semidet_from_rev_char_list(Chars, Str0) then
> Str = Str0
> else
> - unexpected($pred, "null character in list")
> + unexpected($pred, "null character or surrogate code point in
> list")
> ).
>
> :- pragma foreign_proc("C",
> @@ -1837,48 +1834,55 @@ from_rev_char_list(Chars, Str) :-
> "{
> MR_Word list_ptr;
> MR_Word size;
> - MR_Char c;
>
> // Loop to calculate list length in `size' using list in `list_ptr'.
> + SUCCESS_INDICATOR = MR_TRUE;
> size = 0;
> list_ptr = Chars;
> while (!MR_list_is_empty(list_ptr)) {
> - c = (MR_Char) MR_list_head(list_ptr);
> - if (MR_is_ascii(c)) {
> - size++;
> - } else {
> - // XXX ILSEQ Do something if c is a surrogate code point.
> - size += MR_utf8_width(c);
> - }
> - list_ptr = MR_list_tail(list_ptr);
> - }
> -
> - // Allocate heap space for string.
> - MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);
> -
> - // Set size to be the offset of the end of the string
> - // (ie the \\0) and null terminate the string.
> - Str[size] = '\\0';
> -
> - // Loop to copy the characters from the list_ptr to the string
> - // in reverse order.
> - list_ptr = Chars;
> - SUCCESS_INDICATOR = MR_TRUE;
> - while (!MR_list_is_empty(list_ptr)) {
> - c = (MR_Char) MR_list_head(list_ptr);
> + MR_Char c = (MR_Char) MR_list_head(list_ptr);
> if (c == '\\0') {
> SUCCESS_INDICATOR = MR_FALSE;
> break;
> }
> if (MR_is_ascii(c)) {
> - size--;
> - Str[size] = c;
> + size++;
> } else {
> - size -= MR_utf8_width(c);
> - MR_utf8_encode(Str + size, c);
> + size_t csize = MR_utf8_width(c);
> + if (csize == 0) {
> + // c is a surrogate code point (or even out of range,
> + // but that is not supposed to happen).
> + SUCCESS_INDICATOR = MR_FALSE;
> + break;
> + }
> + size += csize;
> }
> list_ptr = MR_list_tail(list_ptr);
> }
> +
> + if (SUCCESS_INDICATOR) {
> + // Allocate heap space for string.
> + MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);
> +
> + // Set size to be the offset of the end of the string
> + // (ie the \\0) and null terminate the string.
> + Str[size] = '\\0';
> +
> + // Loop to copy the characters from the list_ptr to the string
> + // in reverse order.
> + list_ptr = Chars;
> + while (! MR_list_is_empty(list_ptr)) {
> + MR_Char c = (MR_Char) MR_list_head(list_ptr);
> + if (MR_is_ascii(c)) {
> + size--;
> + Str[size] = c;
> + } else {
> + size -= MR_utf8_width(c);
> + MR_utf8_encode(Str + size, c);
> + }
> + list_ptr = MR_list_tail(list_ptr);
> + }
> + }
> }").
> :- pragma foreign_proc("C#",
> semidet_from_rev_char_list(Chars::in, Str::uo),
> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index d3cb00262..a4dcf7c3c 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -485,6 +485,7 @@ extern MR_int_least32_t MR_utf8_get_next_mb(const
> MR_String s,
> extern MR_int_least32_t MR_utf8_prev_get(const MR_String s, MR_Integer
> *pos);
>
> // Return the number of bytes required to encode the code point `c' in
> UTF-8.
> +// Returns 0 for surrogate code points or illegal values.
>
> extern size_t MR_utf8_width(MR_Char c);
>
> diff --git a/tests/hard_coded/Mmakefile b/tests/hard_coded/Mmakefile
> index ee9f51179..47e8ec5a8 100644
> --- a/tests/hard_coded/Mmakefile
> +++ b/tests/hard_coded/Mmakefile
> @@ -365,6 +365,7 @@ ORDINARY_PROGS = \
> string_count_codepoints_ilseq \
> string_first_char \
> string_fold_ilseq \
> + string_from_char_list_ilseq \
> string_index_ilseq \
> string_index_next_ilseq \
> string_loop \
> diff --git a/tests/hard_coded/null_char.exp
> b/tests/hard_coded/null_char.exp
> index 7e03f6907..842f8aa9e 100644
> --- a/tests/hard_coded/null_char.exp
> +++ b/tests/hard_coded/null_char.exp
> @@ -1,6 +1,6 @@
> All tests should result in exceptions being thrown.
> -exception(univ_cons(software_error("predicate `string.from_char_list\'/2:
> Unexpected: null character in list")))
> -exception(univ_cons(software_error("predicate
> `string.from_rev_char_list\'/2: Unexpected: null character in list")))
> +exception(univ_cons(software_error("predicate `string.from_char_list\'/2:
> Unexpected: null character or surrogate code point in list")))
> +exception(univ_cons(software_error("predicate
> `string.from_rev_char_list\'/2: Unexpected: null character or surrogate
> code point in list")))
> exception(univ_cons(software_error("predicate `string.set_char\'/4:
> Unexpected: null character")))
> exception(univ_cons(software_error("predicate
> `string.unsafe_set_char\'/4: Unexpected: null character")))
> error("", io_error("null character in input"))
> diff --git a/tests/hard_coded/string_from_char_list_ilseq.exp
> b/tests/hard_coded/string_from_char_list_ilseq.exp
> new file mode 100644
> index 000000000..73f24d867
> --- /dev/null
> +++ b/tests/hard_coded/string_from_char_list_ilseq.exp
> @@ -0,0 +1,15 @@
> +semidet_from_char_list:
> +a b
> +semidet_from_char_list failed
> +semidet_from_char_list failed
> +semidet_from_char_list failed
> +semidet_from_char_list failed
> +semidet_from_char_list failed
> +
> +semidet_from_rev_char_list:
> +b a
> +semidet_from_rev_char_list failed
> +semidet_from_rev_char_list failed
> +semidet_from_rev_char_list failed
> +semidet_from_rev_char_list failed
> +semidet_from_rev_char_list failed
> diff --git a/tests/hard_coded/string_from_char_list_ilseq.exp2
> b/tests/hard_coded/string_from_char_list_ilseq.exp2
> new file mode 100644
> index 000000000..c77d3d069
> --- /dev/null
> +++ b/tests/hard_coded/string_from_char_list_ilseq.exp2
> @@ -0,0 +1,15 @@
> +semidet_from_char_list:
> +a b
> +semidet_from_char_list failed
> +a 0xd83d b
> +a 0xde00 b
> +a 😀 b
> +a 0xde00 0xd83d b
> +
> +semidet_from_rev_char_list:
> +b a
> +semidet_from_rev_char_list failed
> +b 0xd83d a
> +b 0xde00 a
> +b 0xde00 0xd83d a
> +b 😀 a
> diff --git a/tests/hard_coded/string_from_char_list_ilseq.m
> b/tests/hard_coded/string_from_char_list_ilseq.m
> new file mode 100644
> index 000000000..740f256b3
> --- /dev/null
> +++ b/tests/hard_coded/string_from_char_list_ilseq.m
> @@ -0,0 +1,88 @@
>
> +%---------------------------------------------------------------------------%
> +% vim: ts=4 sw=4 et ft=mercury
>
> +%---------------------------------------------------------------------------%
> +%
> +% The .exp file is for backends using UTF-8 string encoding.
> +% The .exp2 file is for backends using UTF-16 string encoding.
> +%
>
> +%---------------------------------------------------------------------------%
> +
> +:- module string_from_char_list_ilseq.
> +:- interface.
> +
> +:- import_module io.
> +
> +:- pred main(io::di, io::uo) is det.
> +
>
> +%---------------------------------------------------------------------------%
> +
> +:- implementation.
> +
> +:- import_module char.
> +:- import_module list.
> +:- import_module string.
> +
>
> +%---------------------------------------------------------------------------%
> +
> +main(!IO) :-
> + io.write_string("semidet_from_char_list:\n", !IO),
> + list.foldl(test_from_char_list, test_cases, !IO),
> +
> + io.write_string("\nsemidet_from_rev_char_list:\n", !IO),
> + list.foldl(test_from_rev_char_list, test_cases, !IO).
> +
> +:- func test_cases = list(list(char)).
> +
> +test_cases = [
> + ['a', 'b'],
> + ['a', char.det_from_int(0), 'b'],
> + ['a', char.det_from_int(0xd83d), 'b'],
> + ['a', char.det_from_int(0xde00), 'b'],
> + ['a', char.det_from_int(0xd83d), char.det_from_int(0xde00), 'b'],
> + ['a', char.det_from_int(0xde00), char.det_from_int(0xd83d), 'b']
> +].
> +
> +:- pred test_from_char_list(list(char)::in, io::di, io::uo) is det.
> +
> +test_from_char_list(Chars, !IO) :-
> + ( if semidet_from_char_list(Chars, Str) then
> + write_string_debug(Str, !IO),
> + io.nl(!IO)
> + else
> + io.write_string("semidet_from_char_list failed\n", !IO)
> + ).
> +
> +:- pred test_from_rev_char_list(list(char)::in, io::di, io::uo) is det.
> +
> +test_from_rev_char_list(Chars, !IO) :-
> + ( if semidet_from_rev_char_list(Chars, Str) then
> + write_string_debug(Str, !IO),
> + io.nl(!IO)
> + else
> + io.write_string("semidet_from_rev_char_list failed\n", !IO)
> + ).
> +
> +:- pred write_string_debug(string::in, io::di, io::uo) is det.
> +
> +write_string_debug(S, !IO) :-
> + write_string_debug_loop(S, 0, !IO).
> +
> +:- pred write_string_debug_loop(string::in, int::in, io::di, io::uo) is
> det.
> +
> +write_string_debug_loop(S, Index, !IO) :-
> + ( if string.index_next(S, Index, NextIndex, Char) then
> + ( if is_surrogate(Char) then
> + write_hex(char.to_int(Char), !IO)
> + else
> + io.write_char(Char, !IO)
> + ),
> + io.write_char(' ', !IO),
> + write_string_debug_loop(S, NextIndex, !IO)
> + else
> + true
> + ).
> +
> +:- pred write_hex(int::in, io::di, io::uo) is det.
> +
> +write_hex(I, !IO) :-
> + io.format("%#x", [i(I)], !IO).
> diff --git a/tests/hard_coded/string_hash.m
> b/tests/hard_coded/string_hash.m
> index 1e8afe68b..5d534935b 100644
> --- a/tests/hard_coded/string_hash.m
> +++ b/tests/hard_coded/string_hash.m
> @@ -93,7 +93,12 @@ rand_char(Char, !RS) :-
> random.random(Rand, !RS),
> % U+0001..U+10ffff (avoid null character).
> Int = 1 + (Rand `mod` char.max_char_value),
> - char.det_from_int(Int, Char).
> + char.det_from_int(Int, Char0),
> + ( if is_surrogate(Char0) then
> + rand_char(Char, !RS)
> + else
> + Char = Char0
> + ).
>
> :- pred test_hash(string::in, int::in, int::in, string::in,
> bool::in, bool::out, io::di, io::uo) is det.
> --
> 2.23.0
>
> _______________________________________________
> reviews mailing list
> reviews at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/reviews
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20191029/2f94cd2f/attachment-0001.html>
More information about the reviews
mailing list