[m-rev.] for review: Define behaviour of string.index on ill-formed sequences.

Julien Fischer jfischer at opturion.com
Wed Oct 16 17:57:46 AEDT 2019


Hi Peter,

On Wed, 16 Oct 2019, Peter Wang wrote:

> library/string.m:
>    Make string.index/3 and string.unsafe_index/3
>    return either U+FFFD or an unpaired surrogate code point
>    when an ill-formed code unit sequence is detected.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_index_ilseq.exp:
> tests/hard_coded/string_index_ilseq.exp2:
> tests/hard_coded/string_index_ilseq.m:
>    Add test case.
> ---
> library/string.m                         | 70 ++++++++++--------------
> tests/hard_coded/Mmakefile               |  1 +
> tests/hard_coded/string_index_ilseq.exp  | 10 ++++
> tests/hard_coded/string_index_ilseq.exp2 |  6 ++
> tests/hard_coded/string_index_ilseq.m    | 42 ++++++++++++++
> 5 files changed, 87 insertions(+), 42 deletions(-)
> create mode 100644 tests/hard_coded/string_index_ilseq.exp
> create mode 100644 tests/hard_coded/string_index_ilseq.exp2
> create mode 100644 tests/hard_coded/string_index_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index bffebc68f..09716dc49 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -228,29 +228,32 @@
>
>     % index(String, Index, Char):
>     %
> -    % `Char' is the character (code point) in `String', beginning at the
> -    % code unit `Index'. Fails if `Index' is out of range (negative, or
> -    % greater than or equal to the length of `String').
> +    % If `Index' is the initial code unit offset of a well-formed code unit
> +    % sequence in `String' then `Char' is the code point encoded by that
> +    % sequence.
> +    %
> +    % If the code unit in `String' at `Index' is part of an ill-formed sequence
> +    % then `Char' is either U+FFFD REPLACEMENT CHARACTER (if strings use UTF-8

    is either a

I suggest:

     when strings are UTF-8 encoded

> +    % encoding) or the unpaired surrogate code point at `Index' (if strings use
> +    % UTF-16 encoding).

Likewise:

     when strings are UTF-16 encoded

We should probably also add:

     char.is_replacement_character/1

to the stdlib (e.g. for use with filters etc).

Julien.


More information about the reviews mailing list