[m-rev.] for review: Define behaviour of string.to_char_list (and rev) on ill-formed sequences.
Peter Wang
novalazy at gmail.com
Mon Oct 21 14:06:40 AEDT 2019
On Thu, 17 Oct 2019 04:24:38 +1100, Mark Brown <mark at mercurylang.org> wrote:
> On Thu, Oct 17, 2019 at 3:52 AM Mark Brown <mark at mercurylang.org> wrote:
> >
> > Hmm, the reverse mode is not det, as U+FFFD relates to every possible ill-formed sequence (as well as to the correctly formed replacement char). The existing implementation is similarly incorrect. I would suggest leaving these as is and defining new functions to convert to/from char lists (in addition to ones you proposed to convert to char_or_code_unit).
>
> In fact, since the signature of this predicate implies the ability to
> round-trip strings, maybe it should implement your other proposal to
> inject utf8 code units into the surrogate range after all. While I'm
> not convinced it's a good idea to do that generally, it's probably the
> least bad option for existing code that uses to_char_list/2.
Hi Mark,
I think providing predicates/functions to convert strings to lists and
back without losing the ill-formed sequences would be convenient, but
I'm not comfortable with assigning special interpretations to chars
in some predicates.
There are plently of bits left over in a word so the type
:- type char_or_byte
---> char(char)
; byte(uint8).
wouldn't incur any boxing (at least for C backends) so it should be
enough to provide conversions between string and list(char_or_byte).
The implied ability of to_char_list to round trip is a bit of a problem.
I suggest we deprecate and eventually remove the reverse modes of
to_char_list, to_rev_char_list, etc. Since `pragma obsolete' cannot warn
about uses of a particular mode of a predicate only, I suggest we mark
the predicates obsolete with a note explaining that only the reverse
modes are deprecated, and ask users to call the singly-moded
to_char_list/from_char_list functions instead.
Peter
More information about the reviews
mailing list