[m-dev.] io.read_named_file_as_*

Fri Apr 21 16:59:01 AEST 2023

On Thu, 20 Apr 2023 20:18:35 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>  
> Thanks. But before implementing it, I looked into what the lexer currently does
> for non-well-formed input.  (The parser never looks at raw characters; the lexer does.) 
> I found a mess. The attached lexer_utf file documents a 3x2 matrix: target languages
> C, C# and Java, and reading data from a file or from a string. I can tell what is happening
> in five of those combos, but the sixth left me stumped, because (what I take to be)
> the official Microsoft documentation is silent on what the relevant C# method does
> when given non-well-formed input. Can anyone fill in this slot in the matrix?
> 
> Zoltan.

> -------------------------------------------------------------------------
> The io version of reading a token list.
> 
>     The top levels of the call tree, shared among all target languages:
>     The target-language-specific sections below extend this.
> 
>     mercury_term_lexer.get_token_list
>         mercury_term_lexer.get_token
>             mercury_term_lexer.get_token_2
>                 io.read_char_unboxed
>                     io.primitives_read.read_char_code
>                         io.primitives_read.read_char_code_2
> 
> --------------------
>     c
>                             mercury_string.c: MR_utf8_get
> 
>         returns ERROR for invalid UTF-8
> 
> -----------------
>     c#
>                             io.primitives_read: C# mercury_getc
>                                 mf.reader = new System.IO.StreamReader(...)
>                                 mf.reader.Read
> 
>         WHAT DOES StreamReader return for invalid UTF-8, or invalidUTF-16?
>         the following official-looking doc is silent on the issue
>         https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader.read?view=net-8.0#system-io-streamreader-read

It depends on the encoding that the StreamReader was instantiated with.

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#Replacement

The default behaviour is to return the replacement character.
It's possible to create a Encoding instance that throws an exception
instead.

https://learn.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=net-7.0#system-text-utf8encoding-ctor(system-boolean-system-boolean)

> -----------------
>     java
>                             io.stream_ops: Java MR_TextInputFile.read_char
> 
>         returns the 0xFFFD replacement character for invalid UTF-16
>         that UTF-16 could have been created by conversion from UTF-8
> 

That is the default behaviour. As with C#, it's possible to make the
stream reader throw an exception on an ill-formed sequence.
(See the following program, thanks to ChatGPT.)

Peter

import java.io.*;
import java.nio.charset.*;

class Main {
  public static void main(String[] args) {
    try {
      // Open the file using an InputStreamReader with the specified
      // encoding and the REPORT error action
      FileInputStream fis = new FileInputStream("test.txt");
      Charset charset = Charset.forName("UTF-8");
      CharsetDecoder decoder = charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
      InputStreamReader isr = new InputStreamReader(fis, decoder);

      int c;

      // Read characters from the file until the end of the file is reached
      while ((c = isr.read()) != -1) {
        // Do something with the character
        System.out.print((char)c);
      }

      // Close the InputStreamReader and FileInputStream
      isr.close();
      fis.close();
    } catch (MalformedInputException e) {
      System.out.println("An error occurred: " + e.getMessage());
    } catch (IOException e) {
      System.out.println("An error occurred: " + e.getMessage());
    }
  }
}