<div dir="ltr">This looks fine.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 29, 2019 at 12:02 PM Peter Wang <<a href="mailto:novalazy@gmail.com">novalazy@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">library/string.m:<br>
Make from_char_list, from_rev_char_list, to_char_list throw an<br>
exception if the list of chars includes a surrogate code point that<br>
cannot be encoded in a UTF-8 string.<br>
<br>
Make semidet_from_char_list, semidet_from_rev_char_list,<br>
to_char_list fail if the list of chars includes a surrogate code<br>
point that cannot be encoded in a UTF-8 string.<br>
<br>
runtime/mercury_string.h:<br>
Document return value of MR_utf8_width.<br>
<br>
tests/hard_coded/Mmakefile:<br>
tests/hard_coded/string_from_char_list_ilseq.exp:<br>
tests/hard_coded/string_from_char_list_ilseq.exp2:<br>
tests/hard_coded/string_from_char_list_ilseq.m:<br>
Add test case.<br>
<br>
tests/hard_coded/null_char.exp:<br>
Expect new message in exceptions thrown by from_char_list,<br>
from_rev_char_list.<br>
<br>
tests/hard_coded/string_hash.m:<br>
Don't generate surrogate code points in random strings.<br>
---<br>
library/string.m | 162 +++++++++---------<br>
runtime/mercury_string.h | 1 +<br>
tests/hard_coded/Mmakefile | 1 +<br>
tests/hard_coded/null_char.exp | 4 +-<br>
.../string_from_char_list_ilseq.exp | 15 ++<br>
.../string_from_char_list_ilseq.exp2 | 15 ++<br>
.../hard_coded/string_from_char_list_ilseq.m | 88 ++++++++++<br>
tests/hard_coded/string_hash.m | 7 +-<br>
8 files changed, 211 insertions(+), 82 deletions(-)<br>
create mode 100644 tests/hard_coded/string_from_char_list_ilseq.exp<br>
create mode 100644 tests/hard_coded/string_from_char_list_ilseq.exp2<br>
create mode 100644 tests/hard_coded/string_from_char_list_ilseq.m<br>
<br>
diff --git a/library/string.m b/library/string.m<br>
index ab19974d2..cbdbcd128 100644<br>
--- a/library/string.m<br>
+++ b/library/string.m<br>
@@ -131,10 +131,10 @@<br>
% If strings use UTF-16 encoding then each unpaired surrogate code point<br>
% is returned as a separate code point in the list.<br>
%<br>
- % The reverse mode of the predicate throws an exception if<br>
- % the list of characters contains a null character.<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % The reverse mode of the predicate throws an exception if the list<br>
+ % contains a null character or code point that cannot be encoded in a<br>
+ % string (namely, surrogate code points cannot be encoded in UTF-8<br>
+ % strings).<br>
%<br>
% The reverse mode of to_char_list/2 is deprecated because the implied<br>
% ability to round trip convert a string to a list then back to the same<br>
@@ -155,10 +155,10 @@<br>
% If strings use UTF-16 encoding then each unpaired surrogate code point<br>
% is returned as a separate code point in the list.<br>
%<br>
- % The reverse mode of the predicate throws an exception if<br>
- % the list of characters contains a null character.<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % The reverse mode of the predicate throws an exception if the list<br>
+ % contains a null character or code point that cannot be encoded in a<br>
+ % string (namely, surrogate code points cannot be encoded in UTF-8<br>
+ % strings).<br>
%<br>
% The reverse mode of to_rev_char_list/2 is deprecated because the implied<br>
% ability to round trip convert a string to a list then back to the same<br>
@@ -171,10 +171,9 @@<br>
:- mode to_rev_char_list(uo, in) is det.<br>
<br>
% Convert a list of characters (code points) to a string.<br>
- % Throws an exception if the list of characters contains a null character.<br>
- %<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % Throws an exception if the list contains a null character or code point<br>
+ % that cannot be encoded in a string (namely, surrogate code points cannot<br>
+ % be encoded in UTF-8 strings).<br>
%<br>
% The forward mode of from_char_list/2 is deprecated because the implied<br>
% ability to round trip convert a string to a list then back to the same<br>
@@ -187,28 +186,21 @@<br>
:- mode from_char_list(out, in) is det.<br>
<br>
% As above, but fail instead of throwing an exception if the list contains<br>
- % a null character.<br>
- %<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % a null character or code point that cannot be encoded in a string.<br>
%<br>
:- pred semidet_from_char_list(list(char)::in, string::uo) is semidet.<br>
<br>
% Same as from_char_list, except that it reverses the order<br>
% of the characters.<br>
- % Throws an exception if the list of characters contains a null character.<br>
- %<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % Throws an exception if the list contains a null character or code point<br>
+ % that cannot be encoded in a string (namely, surrogate code points cannot<br>
+ % be encoded in UTF-8 strings).<br>
%<br>
:- func from_rev_char_list(list(char)::in) = (string::uo) is det.<br>
:- pred from_rev_char_list(list(char)::in, string::uo) is det.<br>
<br>
% As above, but fail instead of throwing an exception if the list contains<br>
- % a null character.<br>
- %<br>
- % NOTE: In the future we may also throw an exception if the list contains<br>
- % a surrogate code point.<br>
+ % a null character or code point that cannot be encoded in a string.<br>
%<br>
:- pred semidet_from_rev_char_list(list(char)::in, string::uo) is semidet.<br>
<br>
@@ -1696,7 +1688,7 @@ from_char_list(Chars::in, Str::uo) :-<br>
( if semidet_from_char_list(Chars, Str0) then<br>
Str = Str0<br>
else<br>
- unexpected($pred, "null character in list")<br>
+ unexpected($pred, "null character or surrogate code point in list")<br>
).<br>
<br>
:- pragma foreign_proc("C",<br>
@@ -1704,51 +1696,55 @@ from_char_list(Chars::in, Str::uo) :-<br>
[will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,<br>
does_not_affect_liveness, may_not_duplicate, no_sharing],<br>
"{<br>
- // mode (uo, in) is det<br>
MR_Word char_list_ptr;<br>
size_t size;<br>
- MR_Char c;<br>
<br>
// Loop to calculate list length + sizeof(MR_Word) in `size'<br>
// using list in `char_list_ptr'.<br>
- size = 0;<br>
- char_list_ptr = CharList;<br>
- while (! MR_list_is_empty(char_list_ptr)) {<br>
- c = (MR_Char) MR_list_head(char_list_ptr);<br>
- if (MR_is_ascii(c)) {<br>
- size++;<br>
- } else {<br>
- // XXX ILSEQ Do something if c is a surrogate code point.<br>
- size += MR_utf8_width(c);<br>
- }<br>
- char_list_ptr = MR_list_tail(char_list_ptr);<br>
- }<br>
-<br>
- // Allocate heap space for string.<br>
- MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);<br>
-<br>
- // Loop to copy the characters from the char_list to the string.<br>
SUCCESS_INDICATOR = MR_TRUE;<br>
size = 0;<br>
char_list_ptr = CharList;<br>
while (! MR_list_is_empty(char_list_ptr)) {<br>
- c = (MR_Char) MR_list_head(char_list_ptr);<br>
- // It is an error to put a null character in a string<br>
- // (see the comments at the top of this file).<br>
+ MR_Char c = (MR_Char) MR_list_head(char_list_ptr);<br>
if (c == '\\0') {<br>
SUCCESS_INDICATOR = MR_FALSE;<br>
break;<br>
}<br>
if (MR_is_ascii(c)) {<br>
- Str[size] = c;<br>
size++;<br>
} else {<br>
- size += MR_utf8_encode(Str + size, c);<br>
+ size_t csize = MR_utf8_width(c);<br>
+ if (csize == 0) {<br>
+ // c is a surrogate code point (or even out of range,<br>
+ // but that is not supposed to happen).<br>
+ SUCCESS_INDICATOR = MR_FALSE;<br>
+ break;<br>
+ }<br>
+ size += csize;<br>
}<br>
char_list_ptr = MR_list_tail(char_list_ptr);<br>
}<br>
<br>
- Str[size] = '\\0';<br>
+ if (SUCCESS_INDICATOR) {<br>
+ // Allocate heap space for string.<br>
+ MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);<br>
+<br>
+ // Loop to copy the characters from the char_list to the string.<br>
+ size = 0;<br>
+ char_list_ptr = CharList;<br>
+ while (! MR_list_is_empty(char_list_ptr)) {<br>
+ MR_Char c = (MR_Char) MR_list_head(char_list_ptr);<br>
+ if (MR_is_ascii(c)) {<br>
+ Str[size] = c;<br>
+ size++;<br>
+ } else {<br>
+ size += MR_utf8_encode(Str + size, c);<br>
+ }<br>
+ char_list_ptr = MR_list_tail(char_list_ptr);<br>
+ }<br>
+<br>
+ Str[size] = '\\0';<br>
+ }<br>
}").<br>
:- pragma foreign_proc("C#",<br>
semidet_from_char_list(CharList::in, Str::uo),<br>
@@ -1808,6 +1804,7 @@ semidet_from_char_list(CharList, Str) :-<br>
;<br>
CharList = [C | Cs],<br>
not char.to_int(C, 0),<br>
+ internal_encoding_is_utf8 => not char.is_surrogate(C),<br>
semidet_from_char_list(Cs, Str0),<br>
first_char(Str, C, Str0)<br>
).<br>
@@ -1827,7 +1824,7 @@ from_rev_char_list(Chars, Str) :-<br>
( if semidet_from_rev_char_list(Chars, Str0) then<br>
Str = Str0<br>
else<br>
- unexpected($pred, "null character in list")<br>
+ unexpected($pred, "null character or surrogate code point in list")<br>
).<br>
<br>
:- pragma foreign_proc("C",<br>
@@ -1837,48 +1834,55 @@ from_rev_char_list(Chars, Str) :-<br>
"{<br>
MR_Word list_ptr;<br>
MR_Word size;<br>
- MR_Char c;<br>
<br>
// Loop to calculate list length in `size' using list in `list_ptr'.<br>
+ SUCCESS_INDICATOR = MR_TRUE;<br>
size = 0;<br>
list_ptr = Chars;<br>
while (!MR_list_is_empty(list_ptr)) {<br>
- c = (MR_Char) MR_list_head(list_ptr);<br>
- if (MR_is_ascii(c)) {<br>
- size++;<br>
- } else {<br>
- // XXX ILSEQ Do something if c is a surrogate code point.<br>
- size += MR_utf8_width(c);<br>
- }<br>
- list_ptr = MR_list_tail(list_ptr);<br>
- }<br>
-<br>
- // Allocate heap space for string.<br>
- MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);<br>
-<br>
- // Set size to be the offset of the end of the string<br>
- // (ie the \\0) and null terminate the string.<br>
- Str[size] = '\\0';<br>
-<br>
- // Loop to copy the characters from the list_ptr to the string<br>
- // in reverse order.<br>
- list_ptr = Chars;<br>
- SUCCESS_INDICATOR = MR_TRUE;<br>
- while (!MR_list_is_empty(list_ptr)) {<br>
- c = (MR_Char) MR_list_head(list_ptr);<br>
+ MR_Char c = (MR_Char) MR_list_head(list_ptr);<br>
if (c == '\\0') {<br>
SUCCESS_INDICATOR = MR_FALSE;<br>
break;<br>
}<br>
if (MR_is_ascii(c)) {<br>
- size--;<br>
- Str[size] = c;<br>
+ size++;<br>
} else {<br>
- size -= MR_utf8_width(c);<br>
- MR_utf8_encode(Str + size, c);<br>
+ size_t csize = MR_utf8_width(c);<br>
+ if (csize == 0) {<br>
+ // c is a surrogate code point (or even out of range,<br>
+ // but that is not supposed to happen).<br>
+ SUCCESS_INDICATOR = MR_FALSE;<br>
+ break;<br>
+ }<br>
+ size += csize;<br>
}<br>
list_ptr = MR_list_tail(list_ptr);<br>
}<br>
+<br>
+ if (SUCCESS_INDICATOR) {<br>
+ // Allocate heap space for string.<br>
+ MR_allocate_aligned_string_msg(Str, size, MR_ALLOC_ID);<br>
+<br>
+ // Set size to be the offset of the end of the string<br>
+ // (ie the \\0) and null terminate the string.<br>
+ Str[size] = '\\0';<br>
+<br>
+ // Loop to copy the characters from the list_ptr to the string<br>
+ // in reverse order.<br>
+ list_ptr = Chars;<br>
+ while (! MR_list_is_empty(list_ptr)) {<br>
+ MR_Char c = (MR_Char) MR_list_head(list_ptr);<br>
+ if (MR_is_ascii(c)) {<br>
+ size--;<br>
+ Str[size] = c;<br>
+ } else {<br>
+ size -= MR_utf8_width(c);<br>
+ MR_utf8_encode(Str + size, c);<br>
+ }<br>
+ list_ptr = MR_list_tail(list_ptr);<br>
+ }<br>
+ }<br>
}").<br>
:- pragma foreign_proc("C#",<br>
semidet_from_rev_char_list(Chars::in, Str::uo),<br>
diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h<br>
index d3cb00262..a4dcf7c3c 100644<br>
--- a/runtime/mercury_string.h<br>
+++ b/runtime/mercury_string.h<br>
@@ -485,6 +485,7 @@ extern MR_int_least32_t MR_utf8_get_next_mb(const MR_String s,<br>
extern MR_int_least32_t MR_utf8_prev_get(const MR_String s, MR_Integer *pos);<br>
<br>
// Return the number of bytes required to encode the code point `c' in UTF-8.<br>
+// Returns 0 for surrogate code points or illegal values.<br>
<br>
extern size_t MR_utf8_width(MR_Char c);<br>
<br>
diff --git a/tests/hard_coded/Mmakefile b/tests/hard_coded/Mmakefile<br>
index ee9f51179..47e8ec5a8 100644<br>
--- a/tests/hard_coded/Mmakefile<br>
+++ b/tests/hard_coded/Mmakefile<br>
@@ -365,6 +365,7 @@ ORDINARY_PROGS = \<br>
string_count_codepoints_ilseq \<br>
string_first_char \<br>
string_fold_ilseq \<br>
+ string_from_char_list_ilseq \<br>
string_index_ilseq \<br>
string_index_next_ilseq \<br>
string_loop \<br>
diff --git a/tests/hard_coded/null_char.exp b/tests/hard_coded/null_char.exp<br>
index 7e03f6907..842f8aa9e 100644<br>
--- a/tests/hard_coded/null_char.exp<br>
+++ b/tests/hard_coded/null_char.exp<br>
@@ -1,6 +1,6 @@<br>
All tests should result in exceptions being thrown.<br>
-exception(univ_cons(software_error("predicate `string.from_char_list\'/2: Unexpected: null character in list")))<br>
-exception(univ_cons(software_error("predicate `string.from_rev_char_list\'/2: Unexpected: null character in list")))<br>
+exception(univ_cons(software_error("predicate `string.from_char_list\'/2: Unexpected: null character or surrogate code point in list")))<br>
+exception(univ_cons(software_error("predicate `string.from_rev_char_list\'/2: Unexpected: null character or surrogate code point in list")))<br>
exception(univ_cons(software_error("predicate `string.set_char\'/4: Unexpected: null character")))<br>
exception(univ_cons(software_error("predicate `string.unsafe_set_char\'/4: Unexpected: null character")))<br>
error("", io_error("null character in input"))<br>
diff --git a/tests/hard_coded/string_from_char_list_ilseq.exp b/tests/hard_coded/string_from_char_list_ilseq.exp<br>
new file mode 100644<br>
index 000000000..73f24d867<br>
--- /dev/null<br>
+++ b/tests/hard_coded/string_from_char_list_ilseq.exp<br>
@@ -0,0 +1,15 @@<br>
+semidet_from_char_list:<br>
+a b <br>
+semidet_from_char_list failed<br>
+semidet_from_char_list failed<br>
+semidet_from_char_list failed<br>
+semidet_from_char_list failed<br>
+semidet_from_char_list failed<br>
+<br>
+semidet_from_rev_char_list:<br>
+b a <br>
+semidet_from_rev_char_list failed<br>
+semidet_from_rev_char_list failed<br>
+semidet_from_rev_char_list failed<br>
+semidet_from_rev_char_list failed<br>
+semidet_from_rev_char_list failed<br>
diff --git a/tests/hard_coded/string_from_char_list_ilseq.exp2 b/tests/hard_coded/string_from_char_list_ilseq.exp2<br>
new file mode 100644<br>
index 000000000..c77d3d069<br>
--- /dev/null<br>
+++ b/tests/hard_coded/string_from_char_list_ilseq.exp2<br>
@@ -0,0 +1,15 @@<br>
+semidet_from_char_list:<br>
+a b <br>
+semidet_from_char_list failed<br>
+a 0xd83d b <br>
+a 0xde00 b <br>
+a 😀 b <br>
+a 0xde00 0xd83d b <br>
+<br>
+semidet_from_rev_char_list:<br>
+b a <br>
+semidet_from_rev_char_list failed<br>
+b 0xd83d a <br>
+b 0xde00 a <br>
+b 0xde00 0xd83d a <br>
+b 😀 a <br>
diff --git a/tests/hard_coded/string_from_char_list_ilseq.m b/tests/hard_coded/string_from_char_list_ilseq.m<br>
new file mode 100644<br>
index 000000000..740f256b3<br>
--- /dev/null<br>
+++ b/tests/hard_coded/string_from_char_list_ilseq.m<br>
@@ -0,0 +1,88 @@<br>
+%---------------------------------------------------------------------------%<br>
+% vim: ts=4 sw=4 et ft=mercury<br>
+%---------------------------------------------------------------------------%<br>
+%<br>
+% The .exp file is for backends using UTF-8 string encoding.<br>
+% The .exp2 file is for backends using UTF-16 string encoding.<br>
+%<br>
+%---------------------------------------------------------------------------%<br>
+<br>
+:- module string_from_char_list_ilseq.<br>
+:- interface.<br>
+<br>
+:- import_module io.<br>
+<br>
+:- pred main(io::di, io::uo) is det.<br>
+<br>
+%---------------------------------------------------------------------------%<br>
+<br>
+:- implementation.<br>
+<br>
+:- import_module char.<br>
+:- import_module list.<br>
+:- import_module string.<br>
+<br>
+%---------------------------------------------------------------------------%<br>
+<br>
+main(!IO) :-<br>
+ io.write_string("semidet_from_char_list:\n", !IO),<br>
+ list.foldl(test_from_char_list, test_cases, !IO),<br>
+<br>
+ io.write_string("\nsemidet_from_rev_char_list:\n", !IO),<br>
+ list.foldl(test_from_rev_char_list, test_cases, !IO).<br>
+<br>
+:- func test_cases = list(list(char)).<br>
+<br>
+test_cases = [<br>
+ ['a', 'b'],<br>
+ ['a', char.det_from_int(0), 'b'],<br>
+ ['a', char.det_from_int(0xd83d), 'b'],<br>
+ ['a', char.det_from_int(0xde00), 'b'],<br>
+ ['a', char.det_from_int(0xd83d), char.det_from_int(0xde00), 'b'],<br>
+ ['a', char.det_from_int(0xde00), char.det_from_int(0xd83d), 'b']<br>
+].<br>
+<br>
+:- pred test_from_char_list(list(char)::in, io::di, io::uo) is det.<br>
+<br>
+test_from_char_list(Chars, !IO) :-<br>
+ ( if semidet_from_char_list(Chars, Str) then<br>
+ write_string_debug(Str, !IO),<br>
+ <a href="http://io.nl" rel="noreferrer" target="_blank">io.nl</a>(!IO)<br>
+ else<br>
+ io.write_string("semidet_from_char_list failed\n", !IO)<br>
+ ).<br>
+<br>
+:- pred test_from_rev_char_list(list(char)::in, io::di, io::uo) is det.<br>
+<br>
+test_from_rev_char_list(Chars, !IO) :-<br>
+ ( if semidet_from_rev_char_list(Chars, Str) then<br>
+ write_string_debug(Str, !IO),<br>
+ <a href="http://io.nl" rel="noreferrer" target="_blank">io.nl</a>(!IO)<br>
+ else<br>
+ io.write_string("semidet_from_rev_char_list failed\n", !IO)<br>
+ ).<br>
+<br>
+:- pred write_string_debug(string::in, io::di, io::uo) is det.<br>
+<br>
+write_string_debug(S, !IO) :-<br>
+ write_string_debug_loop(S, 0, !IO).<br>
+<br>
+:- pred write_string_debug_loop(string::in, int::in, io::di, io::uo) is det.<br>
+<br>
+write_string_debug_loop(S, Index, !IO) :-<br>
+ ( if string.index_next(S, Index, NextIndex, Char) then<br>
+ ( if is_surrogate(Char) then<br>
+ write_hex(char.to_int(Char), !IO)<br>
+ else<br>
+ io.write_char(Char, !IO)<br>
+ ),<br>
+ io.write_char(' ', !IO),<br>
+ write_string_debug_loop(S, NextIndex, !IO)<br>
+ else<br>
+ true<br>
+ ).<br>
+<br>
+:- pred write_hex(int::in, io::di, io::uo) is det.<br>
+<br>
+write_hex(I, !IO) :-<br>
+ io.format("%#x", [i(I)], !IO).<br>
diff --git a/tests/hard_coded/string_hash.m b/tests/hard_coded/string_hash.m<br>
index 1e8afe68b..5d534935b 100644<br>
--- a/tests/hard_coded/string_hash.m<br>
+++ b/tests/hard_coded/string_hash.m<br>
@@ -93,7 +93,12 @@ rand_char(Char, !RS) :-<br>
random.random(Rand, !RS),<br>
% U+0001..U+10ffff (avoid null character).<br>
Int = 1 + (Rand `mod` char.max_char_value),<br>
- char.det_from_int(Int, Char).<br>
+ char.det_from_int(Int, Char0),<br>
+ ( if is_surrogate(Char0) then<br>
+ rand_char(Char, !RS)<br>
+ else<br>
+ Char = Char0<br>
+ ).<br>
<br>
:- pred test_hash(string::in, int::in, int::in, string::in,<br>
bool::in, bool::out, io::di, io::uo) is det.<br>
-- <br>
2.23.0<br>
<br>
_______________________________________________<br>
reviews mailing list<br>
<a href="mailto:reviews@lists.mercurylang.org" target="_blank">reviews@lists.mercurylang.org</a><br>
<a href="https://lists.mercurylang.org/listinfo/reviews" rel="noreferrer" target="_blank">https://lists.mercurylang.org/listinfo/reviews</a><br>
</blockquote></div>