[m-users.] Read text files encoded as Latin1 or UTF16-LE
Peter Wang
novalazy at gmail.com
Mon Oct 21 12:03:15 AEDT 2019
On Sun, 20 Oct 2019 12:27:38 +0200, Dirk Ziegemeyer <dirk at ziegemeyer.de> wrote:
> Hi,
>
> > Am 19.10.2019 um 02:58 schrieb Peter Wang <novalazy at gmail.com>:
> >
> > There is not yet any way to specify an encoding when opening streams.
> > We would add a new version of io.open_input that takes an encoding
> > argument. That should be fairly easy to add for the Java/C# backends.
>
> Maybe also add a predicate to change encoding of an already opened input stream? This might be useful when reading XML documents where encoding is specified in the XML declaration on top of the file.
>
Yes, that's an interesting use case. The tricky part is if there is any
buffering between the decoder and the user; the underlying stream may
have advanced further where the user has read up to.
With the Java/C# style APIs, you can stack on buffering as a later
stage, unlike my suggestion for io.open_input. I don't know how XML
encoding declarations are normally handled though.
> > For the C backends, there is iconv(3) on most platforms and probably
> > something equivalent on Windows. Probably we would want a stream
> > abstraction similar to stdio FILEs that does the conversions
> > transparency, so io.m can assume all input/output is UTF-8.
> > [The "MR_NEW_MERCURYFILE_STRUCT" stuff may be of use, or not.]
> >
> > For reading/writing ISO-8859-1 or UTF-16 only, we wouldn't really
> > need an external library.
>
> I would prefer not to depend on any additional LGPL library in order to make it easier to comply with licence conditions for statically linked binaries on Windows.
Yes, the GNU libiconv library is LGPL. On most platforms, iconv() is
included in the platform's standard C library. On Windows, we would need
an alternative iconv implementation or use the native API. All I can
find is MultiByteToWideChar so, in general, we would need to convert to
UTF-16 as an intermediate step before converting to UTF-8.
>
> I’m curious how other users read files that are not UTF-8 encoded and if there is a need to cover any other type of exotic encodings.
>
> Covering ISO-8859-1 and UTF-16 would be enough for my use cases. I read large ISO-8859-1 encoded files from data sources that are not under my control. Getting rid of calling iconv(1) would avoid creating large temporary files. I use UTF-16-LE to read small Excel spreadsheets with Mercury that were saved from within Excel as „Unicode.txt“. Saving as Unicode.txt turned out to produce always the same result regardless of using Excel on Mac or Windows and independent of the used locale setting (English vs. European).
For the CSV library, you could implement two type class instances,
one to read bytes (from binary_input_streams) from ISO-8859-1 files and
return them as chars, and another for UTF-16.
Peter
More information about the users
mailing list