[m-rev.] [m-dev.] io.read_named_file_as_*

Sun Apr 9 13:39:51 AEST 2023

2023-04-09 12:02 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> On Sat, 08 Apr 2023 11:28:14 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>> >> The relevant documentation warns about the parser
>> >> that operates on the whole file as a string not being expected
>> >> to do the right thing given a malformed UTF-{8,16} string.
>> >
>> > I don't think I can find the warning you are referring to.
>> 
>> The warnings on read_named_file_as_string and read_named_file_as_lines.
> 
> Oh, I thought you meant the Mercury term parser.
> 
> read_named_file_as_string and read_named_file_as_lines
> do the right thing, for a definition of "right thing".

There are several "things" that people can consider the "right thing"
wrt these predicates. The comments on their declarations do not say
which, if any, of these things they do. Can you please fix that?
For the code path I am interested in, read_file_as_string when targeting C,
read_file_as_string just allocs a buffer the size of the file, fills it from
the file, and checks for premature NULs, but I see no checks for
non-well-formed chars.

I note that the read_line predicate *does* check for non-well-formed chars
(in the C foreign_proc for read_char_code_2), but that this code is not linked
to other code that does the same check, such as unsafe_index_next_repl
in string.m.

Attached is an incomplete diff proposing improved wording for
the documentation of index_next and index_next_repl.  It seems that
index_next_repl *always* returns not_replaced when targeting UTF-16.
Its old documented seems to *allude* to this fact, but does not state it
outright. I would like to know whether this was deliberate or oversight.
I can see a reason why not returning a meaningful MaybeReplaced value
with UTF-16 would be acceptable: callers can test for themselves
whether the returned Char is in the surrogate range or not. On the other
hand, writing code that works on *both* UTF-8 and UTF-16 encodings
would be simpler if index_next_repl returned a meaningful MaybeReplaced
value even on UTF-16, and looking at the Java and C# foreign_procs,
arranging that would be in C#. I could find any equivalent for Java;
do you know of any? Specifially, would just checking that Ch is inside/outside
the surrogate range work, and be acceptably fast? If yes, then we should always
return a meaningful MaybeReplaced value. If not, then always returning
not_replaced for UTF-16 is fine. In either case, the chosen behavior should be
documented, together with the rationale for any differences in behavior
between targets.

>> > The parser should be able to report a malformed code unit sequence as a
>> > syntax error. Currently, each code unit in a malformed sequence will be
>> > silently treated as the Unicode replacement character; we can change
>> > that.
>> 
>> I presume you mean "change it to report an error".
>> 
> 
> Yes.

We can either

- change mercury_term_parser.m to *always* report an error when
  it finds a non-well-formed character, or

- change it to report an error *only* if some flag says it should.

The first breaks backwards compatibility (though only in a hopefully-rare
situation), while the second would require either an additional argument
on read_term and its variants (breaking compatibility for everyone though
in an easily-visible and easy-to-fix way), or additional variants of existing
predicates.

> I'm mainly pointing out that checking for malformed sequences can be
> done during parsing instead of a preceding step. I do think it's worth
> doing, but later.

The above analysis shows that doing it during parsing is not trivial.
And even if it were, some programs that read in a file as a string
won't give the resulting string to the parser, so detecting non-well-formed
chars via the parser won't work for them.

> The changes look fine.

Thank you.

Zoltan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DIFF.si
Type: application/octet-stream
Size: 5676 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230409/f2116c16/attachment.obj>