[m-rev.] [m-dev.] io.read_named_file_as_*

Sat Apr 8 11:28:14 AEST 2023

2023-04-07 18:05 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> On Fri, 07 Apr 2023 10:44:36 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>> I am changing analysis.file.m to handle errors by returning error_specs
>> instead of throwing exceptions, and in the process found it
>> convenient to also change it from reading in one term at a time
>> to reading in the whole file at once, and parsing each term
>> without I/O. (There is no point in writing code to handle errors
>> arising from the reading of a term if we can just eliminate
>> that source of possible errors.)
> 
> As you like, but my conclusion about the analysis work is that it's the
> wrong approach. If someone wanted to continue with it, I suggest
> thinking about using a database instead of reading/writing files
> containing Mercury terms.

I understand the pull of using a full DBMS, but it brings disadvantages
as well as advantages. Neither requiring users to install it themselves
nor including it as part of the Mercury distribution would be particularly
palatable, and the sorts of non-SQL DBMSs that would have support
for structured terms in the database also tend to give up the ACID
properties that we would want.

In any case, the attached diff implements the approach I outlined
in my earlier email. For review by anyone; the diff is with -b.

>> The relevant documentation warns about the parser
>> that operates on the whole file as a string not being expected
>> to do the right thing given a malformed UTF-{8,16} string.
> 
> I don't think I can find the warning you are referring to.

The warnings on read_named_file_as_string and read_named_file_as_lines.

> The parser should be able to report a malformed code unit sequence as a
> syntax error. Currently, each code unit in a malformed sequence will be
> silently treated as the Unicode replacement character; we can change
> that.

I presume you mean "change it to report an error".

> If a malformed sequence occurs within a comment, I think it's
> preferable to ignore it instead of reporting an error.

Analysis files cannot contain comments, so your proposal seems
isomorphic to mine.

> If we do those things, there is no need to call is_well_formed.
> 
> ** Note that UTF-16 strings cannot represent malformed UTF-8 code unit
> sequences, so by the time the input string gets to the parser, the
> malformed sequences will already have been replaced by the Unicode
> replacement character, making the string well formed. I assume it's
> possible to make the C# and Java backends report malformed UTF-8 when
> reading from a file, if that's what we wanted (I'm not sure that's
> preferable in general).

I don't know nearly enough about what real code that really cares
about i18n would want to do about malformed UTF-N for any N,
but for me, a policy of "any non-well-formed character in any file
processed by the Mercury compiler is an error that we report
together with its location" seems both simpler, and just as good,
as any other policy. Specifically, I cannot see a situation in which
people would purposefully want to put non-well-formed chars
into e.g. a .m file. Can you?

>> I could add a test for is_well_formed just after calling 
>> read_file_as_string, but I expect that many callers of
>> read_file_as_string will want to do the same, so the logical
>> thing to do is factor out the commonality by adding
>> versions of read_file_as_string (and read_file_as_lines etc)
>> that include the well-formedness check.
>> 
>> I propose that we do so, in io.m, since we haven't started
>> migrating user-visible predicates to io.text_read.m yet.
>> Does anyone object? And what should the names
>> of the new versions be? I propose
>> 
>> - read_file_as_well_formed_string
>> - read_file_as_well_formed_lines
>> - read_well_formed_line
>> - read_well_formed_word
> 
> That seems fine unless you want to use a "_wf" suffix or something.

A _wf suffix it is.

>> For at least the first two, it would be nice to say *what part*
>> of the file is not well formed. I could provide a companion
>> predicate to is_well_formed that *would* return that info,
>> if Peter says he will check my work.
> 
> Sure.

Thank you.

Zoltan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Log.an2
Type: application/octet-stream
Size: 1110 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230408/8658d841/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DIFF.an2
Type: application/octet-stream
Size: 43659 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230408/8658d841/attachment-0003.obj>