[m-users.] Read text files encoded as Latin1 or UTF16-LE

Dirk Ziegemeyer dirk at ziegemeyer.de
Sun Oct 20 21:27:38 AEDT 2019


Hi,

> Am 19.10.2019 um 02:58 schrieb Peter Wang <novalazy at gmail.com>:
> 
> There is not yet any way to specify an encoding when opening streams.
> We would add a new version of io.open_input that takes an encoding
> argument. That should be fairly easy to add for the Java/C# backends.

Maybe also add a predicate to change encoding of an already opened input stream? This might be useful when reading XML documents where encoding is specified in the XML declaration on top of the file. 

> For the C backends, there is iconv(3) on most platforms and probably
> something equivalent on Windows. Probably we would want a stream
> abstraction similar to stdio FILEs that does the conversions
> transparency, so io.m can assume all input/output is UTF-8.
> [The "MR_NEW_MERCURYFILE_STRUCT" stuff may be of use, or not.]
> 
> For reading/writing ISO-8859-1 or UTF-16 only, we wouldn't really
> need an external library.

I would prefer not to depend on any additional LGPL library in order to make it easier to comply with licence conditions for statically linked binaries on Windows.

I’m curious how other users read files that are not UTF-8 encoded and if there is a need to cover any other type of exotic encodings.

Covering ISO-8859-1 and UTF-16 would be enough for my use cases. I read large ISO-8859-1 encoded files from data sources that are not under my control. Getting rid of calling iconv(1) would avoid creating large temporary files. I use UTF-16-LE to read small Excel spreadsheets with Mercury that were saved from within Excel as „Unicode.txt“. Saving as Unicode.txt turned out to produce always the same result regardless of using Excel on Mac or Windows and independent of the used locale setting (English vs. European). 

Dirk


More information about the users mailing list