<div dir="ltr"><div>Hi Peter,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 16, 2019 at 5:04 PM Peter Wang <<a href="mailto:novalazy@gmail.com">novalazy@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">library/string.m:<br>
Define string.to_char_list and string.to_rev_char_list to either<br>
replace code units in ill-formed sequences with U+FFFD or return<br>
unpaired surrogate code points.<br>
<br>
Use Mercury version of do_to_char_list instead of updating<br>
the foreign language implementations.<br></blockquote><div><br></div><div>I agree with this.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
tests/hard_coded/Mmakefile:<br>
tests/hard_coded/string_char_list_ilseq.exp:<br>
tests/hard_coded/string_char_list_ilseq.exp2:<br>
tests/hard_coded/string_char_list_ilseq.m:<br>
Add test case.<br>
---<br>
library/string.m | 123 ++++---------------<br>
tests/hard_coded/Mmakefile | 1 +<br>
tests/hard_coded/string_char_list_ilseq.exp | 5 +<br>
tests/hard_coded/string_char_list_ilseq.exp2 | 5 +<br>
tests/hard_coded/string_char_list_ilseq.m | 50 ++++++++<br>
5 files changed, 88 insertions(+), 96 deletions(-)<br>
create mode 100644 tests/hard_coded/string_char_list_ilseq.exp<br>
create mode 100644 tests/hard_coded/string_char_list_ilseq.exp2<br>
create mode 100644 tests/hard_coded/string_char_list_ilseq.m<br>
<br>
diff --git a/library/string.m b/library/string.m<br>
index 1b0d32e87..0501a3198 100644<br>
--- a/library/string.m<br>
+++ b/library/string.m<br>
@@ -124,9 +124,15 @@<br>
%<br>
<br>
% Convert the string to a list of characters (code points).<br>
+ %<br>
+ % In the forward mode:<br>
+ % If strings use UTF-8 encoding then each code unit in an ill-formed<br>
+ % sequence is replaced by U+FFFD REPLACEMENT CHARACTER in the list.<br>
+ % If strings use UTF-16 encoding then each unpaired surrogate code point<br>
+ % is returned as a separate code point in the list.<br>
+ %<br>
% The reverse mode of the predicate throws an exception if<br>
% the list of characters contains a null character.<br>
- %<br>
% NOTE: In the future we may also throw an exception if the list contains<br>
% a surrogate code point.<br>
%<br>
@@ -135,10 +141,17 @@<br>
:- mode to_char_list(in, out) is det.<br>
:- mode to_char_list(uo, in) is det.<br></blockquote><div><br></div><div>Hmm, the reverse mode is not det, as U+FFFD relates to every possible ill-formed sequence (as well as to the correctly formed replacement char). The existing implementation is similarly incorrect. I would suggest leaving these as is and defining new functions to convert to/from char lists (in addition to ones you proposed to convert to char_or_code_unit).</div><div><br></div><div>The actual code is fine, as is the test case.</div><div><br></div><div>Mark</div><div> </div></div></div>