[m-rev.] for review: read and write bitmaps in bitmap.m

Peter Wang novalazy at gmail.com
Wed Mar 9 10:45:38 AEDT 2022

Previous message: [m-rev.] for review: read and write bitmaps in bitmap.m
Next message: [m-rev.] diff: avoid spurious warnings from GCC
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 09 Mar 2022 09:46:39 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 2022-03-08 19:32 GMT+11:00 "Zoltan Somogyi" <zoltan.somogyi at runbox.com>:
> > Given that string.length's C# definition is "Str.Length", the first bug looks like
> > it may be not ours to fix, though it could still be our fault if we passed the wrong
> > string as Str. I will do another C# bootcheck tonight with a compiler modified
> > to print out the entire contents, as returns by read_file_as_string_and_num_code_units,
> > to check that hypothesis.
> 
> After both Julien's fixes and mine, the bootcheck has finished; the cut-down results
> are attached. For both the tests I checked, the read-in contents of the file
> matches the actual contents. So we have a UTF-8 encoded file containing
> (in one case) 1530 bytes, which we have read into a C# string that (when printed out)
> reproduces those 1530 bytes exactly, but for which Str.Length returns only 1520.
> I am suspecting that Str.Length counts code points, not code units, but as I mentioned
> in my earlier email, I cannot find real info on this in the sea of useless dreck
> on the web.

C# String.Length counts UTF-16 code units.

https://docs.microsoft.com/en-us/dotnet/api/system.string.length?view=net-6.0
"The Length property returns the number of Char objects in this
instance, not the number of Unicode characters."

https://docs.microsoft.com/en-us/dotnet/api/system.char?view=net-6.0
Char objects "Represents a character as a UTF-16 code unit."


string_append_pieces.m consists of 1515 code points.
It has a length of 1530 bytes (= UTF-8 code units).
Each ASCII character in the file takes 1 UTF-8 code unit.
The 5 U+1F600 GRINNING FACE characters take 4 UTF-8 code units each.
Total: 1510 + 5 * 4 = 1530 UTF-8 code units

When converted to UTF-16, string_append_pieces.m takes
1530 UTF-16 code units.
Each ASCII character takes 1 UTF-16 code unit.
The 5 U+1F600 GRINNING FACE characters take 2 UTF-16 code units each.
Total: 1510 + 5 * 2 = 1520 UTF-16 code units, or 3040 bytes

As confirmation:

    % iconv -f UTF-8 -t UTF-16BE < string_append_pieces.m | wc -c
    3040

(Note: we convert to "UTF-16BE" because if we told iconv to convert to
"UTF-16" then it will add a Byte Order Marker to the start of the
output, which is one extra UTF-16 code unit.)

Peter

Previous message: [m-rev.] for review: read and write bitmaps in bitmap.m
Next message: [m-rev.] diff: avoid spurious warnings from GCC
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the reviews mailing list