[m-rev.] for review: Define behaviour of string.to_char_list (and rev) on ill-formed sequences.

Mark Brown mark at mercurylang.org
Thu Oct 17 03:52:51 AEDT 2019


Hi Peter,

On Wed, Oct 16, 2019 at 5:04 PM Peter Wang <novalazy at gmail.com> wrote:

> library/string.m:
>     Define string.to_char_list and string.to_rev_char_list to either
>     replace code units in ill-formed sequences with U+FFFD or return
>     unpaired surrogate code points.
>
>     Use Mercury version of do_to_char_list instead of updating
>     the foreign language implementations.
>

I agree with this.


>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_char_list_ilseq.exp:
> tests/hard_coded/string_char_list_ilseq.exp2:
> tests/hard_coded/string_char_list_ilseq.m:
>     Add test case.
> ---
>  library/string.m                             | 123 ++++---------------
>  tests/hard_coded/Mmakefile                   |   1 +
>  tests/hard_coded/string_char_list_ilseq.exp  |   5 +
>  tests/hard_coded/string_char_list_ilseq.exp2 |   5 +
>  tests/hard_coded/string_char_list_ilseq.m    |  50 ++++++++
>  5 files changed, 88 insertions(+), 96 deletions(-)
>  create mode 100644 tests/hard_coded/string_char_list_ilseq.exp
>  create mode 100644 tests/hard_coded/string_char_list_ilseq.exp2
>  create mode 100644 tests/hard_coded/string_char_list_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index 1b0d32e87..0501a3198 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -124,9 +124,15 @@
>  %
>
>      % Convert the string to a list of characters (code points).
> +    %
> +    % In the forward mode:
> +    % If strings use UTF-8 encoding then each code unit in an ill-formed
> +    % sequence is replaced by U+FFFD REPLACEMENT CHARACTER in the list.
> +    % If strings use UTF-16 encoding then each unpaired surrogate code
> point
> +    % is returned as a separate code point in the list.
> +    %
>      % The reverse mode of the predicate throws an exception if
>      % the list of characters contains a null character.
> -    %
>      % NOTE: In the future we may also throw an exception if the list
> contains
>      % a surrogate code point.
>      %
> @@ -135,10 +141,17 @@
>  :- mode to_char_list(in, out) is det.
>  :- mode to_char_list(uo, in) is det.
>

Hmm, the reverse mode is not det, as U+FFFD relates to every possible
ill-formed sequence (as well as to the correctly formed replacement char).
The existing implementation is similarly incorrect. I would suggest leaving
these as is and defining new functions to convert to/from char lists (in
addition to ones you proposed to convert to char_or_code_unit).

The actual code is fine, as is the test case.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20191017/3ef97d8b/attachment-0001.html>


More information about the reviews mailing list