[m-rev.] for review: read and write bitmaps in bitmap.m

Zoltan Somogyi zoltan.somogyi at runbox.com
Tue Mar 8 19:32:37 AEDT 2022


2022-03-07 11:49 GMT+11:00 "Julien Fischer" <jfischer at opturion.com>:
>> I think that some of those failures may be due to an incorrectly-coded
>> sanity test in parse_module.m, on line 508, which insists on the number
>> of code units in the file being the same as its string length, but the length
>> counts code *points*, not units.
> 
> No, string.length counts code units.

Yes, you are right; I misread the documentation.

> There appears to be a problem with
> the C# version of io.read_file_as_string_and_num_code_units when the
> file contains non-BMP(?) code points (e.g. the emoji in
> tests/hard_coded/char_to_string.m).

Actually, I think there are at least two problems :-(

I ran a bootcheck in C# in which I replaced that sanity check
with one that printed the two mismatched values. The relevant part
of the bootcheck output is attached.

What strikes me about that is when you get these mismatches,
*neither* value is right! For example, for hard_coded/string_append_pieces,
which has five big smiley emojis, the mismatch is

NumCodeUnits = 1515, FileStringLen = 1520

but according to both ls -l and wc, that module's source
actually contains 1530 bytes. I haven't checked all the mismatches,
but the ones I did check, the story was the same: the disagreements
are between two numbers that are both wrong.

So not only do string.length and read_file_as_string_and_num_code_units
both have bugs (which means that deleting the sanity check would not help),
they have different bugs.

Given that string.length's C# definition is "Str.Length", the first bug looks like
it may be not ours to fix, though it could still be our fault if we passed the wrong
string as Str. I will do another C# bootcheck tonight with a compiler modified
to print out the entire contents, as returns by read_file_as_string_and_num_code_units,
to check that hypothesis.

The second bug seems to be ours. read_file_as_string_and_num_code_units
has three separate implementations. For Java, a foreign_proc invokes
a Java method in a handwritten foreign_code block. For C and C#,
we read the file's contents using (possibly repeated) calls to read_into_buffer.
For C, this is implemented using a foreign_proc. For C#, it is done by Mercury
code that reads into the array that constitutes the buffer. This code, read_into_array,
has a bug, in that it repeatedly calls read_char_code, which is documented to read
a code POINT, and for each code point, it increments !Pos by 1. This is understandable,
in that the buffer is an array in which each slot contains a code point, but it conflicts
with the fact that its caller interprets the final Pos as a count of code UNITs,
because that is what the C version returns.

If we move the job of buffer_to_string into read_file_as_string_loop,
we could have one implementation in C that uses only code units, and
one in Mercury for C# that uses code points everywhere until just before returning,
when it uses string.length to count the code units in the final string,
we could probably eliminate the inconsistency caught by the
sanity check. But then, I am just a novice at Unicode, and my attempts
to find out the exactly semantics of Str.Length in C# foundered on the fact
that google finds pretty much only web pages that spend fifty lines of text
and several diagrams to tell you that "Str.Length returns the length of Str" :-(

What do you think?

Zoltan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NCU
Type: application/octet-stream
Size: 4677 bytes
Desc: not available
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20220308/a4cb812b/attachment-0001.obj>


More information about the reviews mailing list