[m-rev.] [m-dev.] io.read_named_file_as_*

Thu Apr 20 20:18:35 AEST 2023

2023-04-11 16:00 GMT+10:00 "Peter Wang" <novalazy at gmail.com>: 
>> I note that the read_line predicate *does* check for non-well-formed chars
>> (in the C foreign_proc for read_char_code_2), but that this code is not linked
>> to other code that does the same check, such as unsafe_index_next_repl
>> in string.m.
> 
> Yes, a Mercury char can't represent a code unit in an non-well-formed
> UTF-8 sequence so the predicates returning chars must return U+FFFD
> or report an error, and the behaviour should be documented.

The attached diff does an end-run around these concerns by
adding predicates that do a check at the end whether the string being read
is well formed. For your review.

>> We can either
>> 
>> - change mercury_term_parser.m to *always* report an error when
>>   it finds a non-well-formed character, or
>> 
>> - change it to report an error *only* if some flag says it should.
>> 
>> The first breaks backwards compatibility (though only in a hopefully-rare
>> situation), while the second would require either an additional argument
>> on read_term and its variants (breaking compatibility for everyone though
>> in an easily-visible and easy-to-fix way), or additional variants of existing
>> predicates.
>> 
> 
> I think the first option is acceptable.

Thanks. But before implementing it, I looked into what the lexer currently does
for non-well-formed input.  (The parser never looks at raw characters; the lexer does.) 
I found a mess. The attached lexer_utf file documents a 3x2 matrix: target languages
C, C# and Java, and reading data from a file or from a string. I can tell what is happening
in five of those combos, but the sixth left me stumped, because (what I take to be)
the official Microsoft documentation is silent on what the relevant C# method does
when given non-well-formed input. Can anyone fill in this slot in the matrix?

Zoltan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Log.wf
Type: application/octet-stream
Size: 1148 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230420/92aa0b4b/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DIFF.wf
Type: application/octet-stream
Size: 32721 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230420/92aa0b4b/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lexer_utf
Type: application/octet-stream
Size: 2505 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20230420/92aa0b4b/attachment-0005.obj>