[m-dev.] io.read_named_file_as_*

Peter Wang novalazy at gmail.com
Tue Apr 11 16:00:52 AEST 2023


On Sun, 09 Apr 2023 13:39:51 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 2023-04-09 12:02 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> > On Sat, 08 Apr 2023 11:28:14 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> >> >> The relevant documentation warns about the parser
> >> >> that operates on the whole file as a string not being expected
> >> >> to do the right thing given a malformed UTF-{8,16} string.
> >> >
> >> > I don't think I can find the warning you are referring to.
> >> 
> >> The warnings on read_named_file_as_string and read_named_file_as_lines.
> > 
> > Oh, I thought you meant the Mercury term parser.
> > 
> > read_named_file_as_string and read_named_file_as_lines
> > do the right thing, for a definition of "right thing".
> 
> There are several "things" that people can consider the "right thing"
> wrt these predicates. The comments on their declarations do not say
> which, if any, of these things they do. Can you please fix that?
> For the code path I am interested in, read_file_as_string when targeting C,
> read_file_as_string just allocs a buffer the size of the file, fills it from
> the file, and checks for premature NULs, but I see no checks for
> non-well-formed chars.

Yes, this allows the C grades to work with text files that are not
strictly UTF-8 encoded (to a certain extent). This might be due to a
mistake of encoding in the file, or the file could be in another
ASCII-compatible encoding such as ISO-8859-*.

It doesn't work the same for Java/C# grades because UTF-16 strings can't
hold arbitrary bytes. Reading from a text file in those grades goes
through a translation from an expected encoding to UTF-16, so any
non-well-formed sequences will be replaced in the string.
We don't expose any way to specify an expected input encoding in the
standard library.

> 
> I note that the read_line predicate *does* check for non-well-formed chars
> (in the C foreign_proc for read_char_code_2), but that this code is not linked
> to other code that does the same check, such as unsafe_index_next_repl
> in string.m.

Yes, a Mercury char can't represent a code unit in an non-well-formed
UTF-8 sequence so the predicates returning chars must return U+FFFD
or report an error, and the behaviour should be documented.

> 
> Attached is an incomplete diff proposing improved wording for
> the documentation of index_next and index_next_repl.  It seems that
> index_next_repl *always* returns not_replaced when targeting UTF-16.
> Its old documented seems to *allude* to this fact, but does not state it
> outright. I would like to know whether this was deliberate or oversight.
> I can see a reason why not returning a meaningful MaybeReplaced value
> with UTF-16 would be acceptable: callers can test for themselves
> whether the returned Char is in the surrogate range or not.

Yes, in Java and C# grades, an unpaired surrogate code point value can
be represented in a Mercury char, so it's not replaced.

> On the other
> hand, writing code that works on *both* UTF-8 and UTF-16 encodings
> would be simpler if index_next_repl returned a meaningful MaybeReplaced
> value even on UTF-16,

Perhaps.

> and looking at the Java and C# foreign_procs,
> arranging that would be in C#. I could find any equivalent for Java;
> do you know of any?

I think you mean that the C# foreign_proc unsafe_index_next_repl_2
has a specific catch block to handle unpaired surrogate code points,
but the Java foreign_proc does not. That's because the Java
String.codePointAt method will return an unpaired surrogate code point
value as-is:
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int)

> Specifially, would just checking that Ch is inside/outside
> the surrogate range work, and be acceptably fast? If yes, then we should always
> return a meaningful MaybeReplaced value. If not, then always returning
> not_replaced for UTF-16 is fine.

It would work. I don't about the speed but I assume it would be fine.

> We can either
> 
> - change mercury_term_parser.m to *always* report an error when
>   it finds a non-well-formed character, or
> 
> - change it to report an error *only* if some flag says it should.
> 
> The first breaks backwards compatibility (though only in a hopefully-rare
> situation), while the second would require either an additional argument
> on read_term and its variants (breaking compatibility for everyone though
> in an easily-visible and easy-to-fix way), or additional variants of existing
> predicates.
> 

I think the first option is acceptable.

The diff looks good, thanks.

Peter


More information about the developers mailing list