[m-users.] Read text files encoded as Latin1 or UTF16-LE

Peter Wang novalazy at gmail.com
Sat Oct 19 11:58:10 AEDT 2019

Previous message: [m-users.] Read text files encoded as Latin1 or UTF16-LE
Next message: [m-users.] Read text files encoded as Latin1 or UTF16-LE
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On Fri, 18 Oct 2019 17:47:35 +0200, Dirk Ziegemeyer <dirk at ziegemeyer.de> wrote:
> Hi,
> 
> is there any best practice how to read text files encoded as Latin1 or UTF16-LE with Mercury?
> 
> When I use Julien’s csv stream reader (https://github.com/juliensf/mercury-csv) to parse a file encoded in Latin1, I first convert it to UTF-8 by calling iconv from within Mercury:
> 
> io.call_system("iconv -f ISO-8859-1 -t UTF-8 File.csv > File.csv.utf8-temp", CmdResult, !IO)
> 
> I would rather perform encoding conversion on the stream level to avoid disk write.
> 
> There is a hint in io.read_char_code documentation about „converting external character encodings into Mercury's internal character representation“, but I could not find out how to tell Mercury the encoding scheme of the stream.

The comment on io.read_char_code refers to the conversions in the
Java/C# backends, where the internal encoding is UTF-16 but the
external encoding is assumed to be UTF-8.

There is not yet any way to specify an encoding when opening streams.
We would add a new version of io.open_input that takes an encoding
argument. That should be fairly easy to add for the Java/C# backends.

For the C backends, there is iconv(3) on most platforms and probably
something equivalent on Windows. Probably we would want a stream
abstraction similar to stdio FILEs that does the conversions
transparency, so io.m can assume all input/output is UTF-8.
[The "MR_NEW_MERCURYFILE_STRUCT" stuff may be of use, or not.]

For reading/writing ISO-8859-1 or UTF-16 only, we wouldn't really
need an external library.

Peter

Previous message: [m-users.] Read text files encoded as Latin1 or UTF16-LE
Next message: [m-users.] Read text files encoded as Latin1 or UTF16-LE
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the users mailing list