[m-dev.] io.read_named_file_as_*
Peter Wang
novalazy at gmail.com
Fri Apr 21 15:46:01 AEST 2023
On Thu, 20 Apr 2023 20:18:35 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>
> 2023-04-11 16:00 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> >> I note that the read_line predicate *does* check for non-well-formed chars
> >> (in the C foreign_proc for read_char_code_2), but that this code is not linked
> >> to other code that does the same check, such as unsafe_index_next_repl
> >> in string.m.
> >
> > Yes, a Mercury char can't represent a code unit in an non-well-formed
> > UTF-8 sequence so the predicates returning chars must return U+FFFD
> > or report an error, and the behaviour should be documented.
>
> The attached diff does an end-run around these concerns by
> adding predicates that do a check at the end whether the string being read
> is well formed. For your review.
>
> >> We can either
> >>
> >> - change mercury_term_parser.m to *always* report an error when
> >> it finds a non-well-formed character, or
> >>
> >> - change it to report an error *only* if some flag says it should.
> >>
> >> The first breaks backwards compatibility (though only in a hopefully-rare
> >> situation), while the second would require either an additional argument
> >> on read_term and its variants (breaking compatibility for everyone though
> >> in an easily-visible and easy-to-fix way), or additional variants of existing
> >> predicates.
> >>
> >
> > I think the first option is acceptable.
>
> Thanks. But before implementing it, I looked into what the lexer currently does
> for non-well-formed input. (The parser never looks at raw characters; the lexer does.)
> I found a mess. The attached lexer_utf file documents a 3x2 matrix: target languages
> C, C# and Java, and reading data from a file or from a string. I can tell what is happening
> in five of those combos, but the sixth left me stumped, because (what I take to be)
> the official Microsoft documentation is silent on what the relevant C# method does
> when given non-well-formed input. Can anyone fill in this slot in the matrix?
> diff --git a/library/string.m b/library/string.m
> index dd5721dcc..a180978e1 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -284,33 +284,68 @@
...
> - % Fails if Index is out of range (negative, or greater than or equal to
> - % the length of String).
> + % - On platforms the encode strings using UTF-8 (i.e. when targeting C)
> + % Char will be set to U+FFFD (the Unicode replacement character).
> + %
> + % - On platforms the encode strings using UTF-16 (i.e. when targeting
> + % C# or Java), Char will be set to the unpaired surrogate code point
> + % at Index. (For more details, see the comment just below.)
> %
> :- pred index_next(string::in, int::in, int::out, char::uo) is semidet.
s/the/that
> % index_next_repl(String, Index, NextIndex, Char, MaybeReplaced):
> %
> - % Like index_next/4 but also returns MaybeReplaced on success.
> - % When Char is not U+FFFD, then MaybeReplaced is always `not_replaced'.
> - % When Char is U+FFFD (the Unicode replacement character), then there are
> - % two cases:
> + % Does the same job as index_next/4 but on success, it also returns
> + % MaybeReplaced, which will specify whether Char is the result
> + % of the replacement of a non-well-formed UTF-8 character with U+FFFD.
> %
> - % - If there is a U+FFFD code point encoded in String at
> - % [Index, NextIndex) then MaybeReplaced is `not_replaced'.
> + % On platforms that encode strings using UTF-8 (i.e. when targeting C),
> + % there are three cases.
> %
> - % - Otherwise, MaybeReplaced is `replaced_code_unit(CodeUnit)' where
> - % CodeUnit is the code unit in String at Index.
> + % - If Char is not U+FFFD, then MaybeReplaced will be `not_replaced'.
> + %
> + % - If char is U+FFFD because there is a well-formed code point encoded
> + % in String starting at Index, and that code point is U+FFFD, then
> + % MaybeReplaced will also be `not_replaced'.
> + %
> + % - If char is U+FFFD but there is *no* well formed code point encoded
> + % in String starting at Index, then MaybeReplaced will be
> + % `replaced_code_unit(CodeUnit)', where CodeUnit is the code unit
> + % at offset Index in String.
s/char/Char
> + %
> + % On platforms that encode strings using UTF-16 (i.e. when targeting C#
> + % or Java), MaybeReplaced will always be bound to `not_replaced'.
> + % The only ways that a UTF-16 string may be non-well-formed are
> + %
> + % - by having a high surrogate code unit (between 0xD800 and 0xDBFF)
> + % that is not immediately followed by a low surrogate code unit
> + % (between 0xDC00 and 0xDFFF), or
> + %
> + % - by having a low surrogate code unit that is not immediately preceded
> + % by a high surrogate code unit.
> + %
> + % In both cases, index_next_repl will return the unpaired surrogate
> + % unchanged as Char. There is no replacement required, because
> + %
> + % - surrogate code units are all the range 0xDC00 to 0xDFFF, and
0xD800 to 0xDFFF
> + % - the Unicode standard deliberately does not assign any characters
> + % to the code points in this range.
> + %
> + % This means that if Char is in this range, then it must be an unpaired
> + % surrogate, but since Char actually appears in String, it won't be
> + % a *replacement* of another character.
> %
> :- pred index_next_repl(string::in, int::in, int::out, char::uo,
> maybe_replaced::out) is semidet.
> @@ -3168,6 +3219,7 @@ is_empty("").
>
> %---------------------%
>
> +/*
> :- pragma foreign_proc("C",
> is_well_formed(S::in),
> [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> @@ -3217,6 +3269,75 @@ is_empty("").
> }
> }
> ").
> +*/
(You can delete it now)
> +
> +is_well_formed(String) :-
> + find_first_ill_formed_pos(String, FirstIllFormedPos),
> + FirstIllFormedPos < 0.
> +
> +check_well_formedness(String, Result) :-
> + find_first_ill_formed_pos(String, FirstIllFormedPos),
> + ( if FirstIllFormedPos < 0 then
> + Result = well_formed
> + else
> + Result = ill_formed(FirstIllFormedPos)
> + ).
The diff looks good.
I'll have a look at the C# thing.
Peter
More information about the developers
mailing list