<div dir="ltr"><div>Hi Peter,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 16, 2019 at 5:04 PM Peter Wang <<a href="mailto:novalazy@gmail.com">novalazy@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">library/string.m:<br> Define string.to_char_list and string.to_rev_char_list to either<br> replace code units in ill-formed sequences with U+FFFD or return<br> unpaired surrogate code points.<br> <br> Use Mercury version of do_to_char_list instead of updating<br> the foreign language implementations.<br></blockquote><div><br></div><div>I agree with this.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> tests/hard_coded/Mmakefile:<br> tests/hard_coded/string_char_list_ilseq.exp:<br> tests/hard_coded/string_char_list_ilseq.exp2:<br> tests/hard_coded/string_char_list_ilseq.m:<br> Add test case.<br> ---<br> library/string.m | 123 ++++---------------<br> tests/hard_coded/Mmakefile | 1 +<br> tests/hard_coded/string_char_list_ilseq.exp | 5 +<br> tests/hard_coded/string_char_list_ilseq.exp2 | 5 +<br> tests/hard_coded/string_char_list_ilseq.m | 50 ++++++++<br> 5 files changed, 88 insertions(+), 96 deletions(-)<br> create mode 100644 tests/hard_coded/string_char_list_ilseq.exp<br> create mode 100644 tests/hard_coded/string_char_list_ilseq.exp2<br> create mode 100644 tests/hard_coded/string_char_list_ilseq.m<br> <br> diff --git a/library/string.m b/library/string.m<br> index 1b0d32e87..0501a3198 100644<br> --- a/library/string.m<br> +++ b/library/string.m<br> @@ -124,9 +124,15 @@<br> %<br> <br> % Convert the string to a list of characters (code points).<br> + %<br> + % In the forward mode:<br> + % If strings use UTF-8 encoding then each code unit in an ill-formed<br> + % sequence is replaced by U+FFFD REPLACEMENT CHARACTER in the list.<br> + % If strings use UTF-16 encoding then each unpaired surrogate code point<br> + % is returned as a separate code point in the list.<br> + %<br> % The reverse mode of the predicate throws an exception if<br> % the list of characters contains a null character.<br> - %<br> % NOTE: In the future we may also throw an exception if the list contains<br> % a surrogate code point.<br> %<br> @@ -135,10 +141,17 @@<br> :- mode to_char_list(in, out) is det.<br> :- mode to_char_list(uo, in) is det.<br></blockquote><div><br></div><div>Hmm, the reverse mode is not det, as U+FFFD relates to every possible ill-formed sequence (as well as to the correctly formed replacement char). The existing implementation is similarly incorrect. I would suggest leaving these as is and defining new functions to convert to/from char lists (in addition to ones you proposed to convert to char_or_code_unit).</div><div><br></div><div>The actual code is fine, as is the test case.</div><div><br></div><div>Mark</div><div> </div></div></div>