[m-rev.] [m-dev.] io.read_named_file_as_*
Peter Wang
novalazy at gmail.com
Fri Apr 21 16:59:01 AEST 2023
On Thu, 20 Apr 2023 20:18:35 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>
> Thanks. But before implementing it, I looked into what the lexer currently does
> for non-well-formed input. (The parser never looks at raw characters; the lexer does.)
> I found a mess. The attached lexer_utf file documents a 3x2 matrix: target languages
> C, C# and Java, and reading data from a file or from a string. I can tell what is happening
> in five of those combos, but the sixth left me stumped, because (what I take to be)
> the official Microsoft documentation is silent on what the relevant C# method does
> when given non-well-formed input. Can anyone fill in this slot in the matrix?
>
> Zoltan.
> -------------------------------------------------------------------------
> The io version of reading a token list.
>
> The top levels of the call tree, shared among all target languages:
> The target-language-specific sections below extend this.
>
> mercury_term_lexer.get_token_list
> mercury_term_lexer.get_token
> mercury_term_lexer.get_token_2
> io.read_char_unboxed
> io.primitives_read.read_char_code
> io.primitives_read.read_char_code_2
>
> --------------------
> c
> mercury_string.c: MR_utf8_get
>
> returns ERROR for invalid UTF-8
>
> -----------------
> c#
> io.primitives_read: C# mercury_getc
> mf.reader = new System.IO.StreamReader(...)
> mf.reader.Read
>
> WHAT DOES StreamReader return for invalid UTF-8, or invalidUTF-16?
> the following official-looking doc is silent on the issue
> https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader.read?view=net-8.0#system-io-streamreader-read
It depends on the encoding that the StreamReader was instantiated with.
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#Replacement
The default behaviour is to return the replacement character.
It's possible to create a Encoding instance that throws an exception
instead.
https://learn.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=net-7.0#system-text-utf8encoding-ctor(system-boolean-system-boolean)
> -----------------
> java
> io.stream_ops: Java MR_TextInputFile.read_char
>
> returns the 0xFFFD replacement character for invalid UTF-16
> that UTF-16 could have been created by conversion from UTF-8
>
That is the default behaviour. As with C#, it's possible to make the
stream reader throw an exception on an ill-formed sequence.
(See the following program, thanks to ChatGPT.)
Peter
import java.io.*;
import java.nio.charset.*;
class Main {
public static void main(String[] args) {
try {
// Open the file using an InputStreamReader with the specified
// encoding and the REPORT error action
FileInputStream fis = new FileInputStream("test.txt");
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(fis, decoder);
int c;
// Read characters from the file until the end of the file is reached
while ((c = isr.read()) != -1) {
// Do something with the character
System.out.print((char)c);
}
// Close the InputStreamReader and FileInputStream
isr.close();
fis.close();
} catch (MalformedInputException e) {
System.out.println("An error occurred: " + e.getMessage());
} catch (IOException e) {
System.out.println("An error occurred: " + e.getMessage());
}
}
}
More information about the reviews
mailing list