[m-rev.] for review: Define behaviour of string.index on ill-formed sequences.
Julien Fischer
jfischer at opturion.com
Wed Oct 16 17:57:46 AEDT 2019
Hi Peter,
On Wed, 16 Oct 2019, Peter Wang wrote:
> library/string.m:
> Make string.index/3 and string.unsafe_index/3
> return either U+FFFD or an unpaired surrogate code point
> when an ill-formed code unit sequence is detected.
>
> tests/hard_coded/Mmakefile:
> tests/hard_coded/string_index_ilseq.exp:
> tests/hard_coded/string_index_ilseq.exp2:
> tests/hard_coded/string_index_ilseq.m:
> Add test case.
> ---
> library/string.m | 70 ++++++++++--------------
> tests/hard_coded/Mmakefile | 1 +
> tests/hard_coded/string_index_ilseq.exp | 10 ++++
> tests/hard_coded/string_index_ilseq.exp2 | 6 ++
> tests/hard_coded/string_index_ilseq.m | 42 ++++++++++++++
> 5 files changed, 87 insertions(+), 42 deletions(-)
> create mode 100644 tests/hard_coded/string_index_ilseq.exp
> create mode 100644 tests/hard_coded/string_index_ilseq.exp2
> create mode 100644 tests/hard_coded/string_index_ilseq.m
>
> diff --git a/library/string.m b/library/string.m
> index bffebc68f..09716dc49 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -228,29 +228,32 @@
>
> % index(String, Index, Char):
> %
> - % `Char' is the character (code point) in `String', beginning at the
> - % code unit `Index'. Fails if `Index' is out of range (negative, or
> - % greater than or equal to the length of `String').
> + % If `Index' is the initial code unit offset of a well-formed code unit
> + % sequence in `String' then `Char' is the code point encoded by that
> + % sequence.
> + %
> + % If the code unit in `String' at `Index' is part of an ill-formed sequence
> + % then `Char' is either U+FFFD REPLACEMENT CHARACTER (if strings use UTF-8
is either a
I suggest:
when strings are UTF-8 encoded
> + % encoding) or the unpaired surrogate code point at `Index' (if strings use
> + % UTF-16 encoding).
Likewise:
when strings are UTF-16 encoded
We should probably also add:
char.is_replacement_character/1
to the stdlib (e.g. for use with filters etc).
Julien.
More information about the reviews
mailing list