[m-rev.] for review: Define behaviour of string.to_char_list (and rev) on ill-formed sequences.
Mark Brown
mark at mercurylang.org
Thu Oct 17 03:52:51 AEDT 2019
Hi Peter,
On Wed, Oct 16, 2019 at 5:04 PM Peter Wang <novalazy at gmail.com> wrote:
> library/string.m:
> Define string.to_char_list and string.to_rev_char_list to either
> replace code units in ill-formed sequences with U+FFFD or return
> unpaired surrogate code points.
>
> Use Mercury version of do_to_char_list instead of updating
> the foreign language implementations.
>
I agree with this.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_char_list_ilseq.exp:
> tests/hard_coded/string_char_list_ilseq.exp2:
> tests/hard_coded/string_char_list_ilseq.m:
> Add test case.
> ---
> library/string.m | 123 ++++---------------
> tests/hard_coded/Mmakefile | 1 +
> tests/hard_coded/string_char_list_ilseq.exp | 5 +
> tests/hard_coded/string_char_list_ilseq.exp2 | 5 +
> tests/hard_coded/string_char_list_ilseq.m | 50 ++++++++
> 5 files changed, 88 insertions(+), 96 deletions(-)
> create mode 100644 tests/hard_coded/string_char_list_ilseq.exp
> create mode 100644 tests/hard_coded/string_char_list_ilseq.exp2
> create mode 100644 tests/hard_coded/string_char_list_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index 1b0d32e87..0501a3198 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -124,9 +124,15 @@
> %
>
> % Convert the string to a list of characters (code points).
> + %
> + % In the forward mode:
> + % If strings use UTF-8 encoding then each code unit in an ill-formed
> + % sequence is replaced by U+FFFD REPLACEMENT CHARACTER in the list.
> + % If strings use UTF-16 encoding then each unpaired surrogate code
> point
> + % is returned as a separate code point in the list.
> + %
> % The reverse mode of the predicate throws an exception if
> % the list of characters contains a null character.
> - %
> % NOTE: In the future we may also throw an exception if the list
> contains
> % a surrogate code point.
> %
> @@ -135,10 +141,17 @@
> :- mode to_char_list(in, out) is det.
> :- mode to_char_list(uo, in) is det.
>
Hmm, the reverse mode is not det, as U+FFFD relates to every possible
ill-formed sequence (as well as to the correctly formed replacement char).
The existing implementation is similarly incorrect. I would suggest leaving
these as is and defining new functions to convert to/from char lists (in
addition to ones you proposed to convert to char_or_code_unit).
The actual code is fine, as is the test case.
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20191017/3ef97d8b/attachment-0001.html>
More information about the reviews
mailing list