[m-dev.] io.read_named_file_as_*

Fri Apr 7 18:05:35 AEST 2023

On Fri, 07 Apr 2023 10:44:36 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> I am changing analysis.file.m to handle errors by returning error_specs
> instead of throwing exceptions, and in the process found it
> convenient to also change it from reading in one term at a time
> to reading in the whole file at once, and parsing each term
> without I/O. (There is no point in writing code to handle errors
> arising from the reading of a term if we can just eliminate
> that source of possible errors.)

As you like, but my conclusion about the analysis work is that it's the
wrong approach. If someone wanted to continue with it, I suggest
thinking about using a database instead of reading/writing files
containing Mercury terms.

> 
> The relevant documentation warns about the parser
> that operates on the whole file as a string not being expected
> to do the right thing given a malformed UTF-{8,16} string.

I don't think I can find the warning you are referring to.

The parser should be able to report a malformed code unit sequence as a
syntax error. Currently, each code unit in a malformed sequence will be
silently treated as the Unicode replacement character; we can change
that. If a malformed sequence occurs within a comment, I think it's
preferable to ignore it instead of reporting an error.

If we do those things, there is no need to call is_well_formed.

** Note that UTF-16 strings cannot represent malformed UTF-8 code unit
sequences, so by the time the input string gets to the parser, the
malformed sequences will already have been replaced by the Unicode
replacement character, making the string well formed. I assume it's
possible to make the C# and Java backends report malformed UTF-8 when
reading from a file, if that's what we wanted (I'm not sure that's
preferable in general).

> I could add a test for is_well_formed just after calling 
> read_file_as_string, but I expect that many callers of
> read_file_as_string will want to do the same, so the logical
> thing to do is factor out the commonality by adding
> versions of read_file_as_string (and read_file_as_lines etc)
> that include the well-formedness check.
> 
> I propose that we do so, in io.m, since we haven't started
> migrating user-visible predicates to io.text_read.m yet.
> Does anyone object? And what should the names
> of the new versions be? I propose
> 
> - read_file_as_well_formed_string
> - read_file_as_well_formed_lines
> - read_well_formed_line
> - read_well_formed_word

That seems fine unless you want to use a "_wf" suffix or something.

> 
> For at least the first two, it would be nice to say *what part*
> of the file is not well formed. I could provide a companion
> predicate to is_well_formed that *would* return that info,
> if Peter says he will check my work.

Sure.

Peter