[m-users.] Layout issue with function string.format_table/2 and emoticons in the source strings

Peter Wang novalazy at gmail.com
Mon Jun 27 17:41:37 AEST 2022


On Sun, 26 Jun 2022 13:29:21 +0100 "Sean Charles (emacstheviking)" <objitsu at gmail.com> wrote:
> Hi,
> 
> I just tried using the string.format_table function and it produces great output with simple code points but when I added the Smiley face, the layout has broken but it might be a terminal issue? I am using iTerm2 on Monterey.
> Is this the expected behaviour or is it an issue in the rendering code?
> It feels like the extra code unit for the Smiley internal storage has not been taken into account when calculating the padding.
> 
> I took a look at the source code for string.m, the pad_row() predicate, lines 5206 to 5243 of mercury-srcdist-rotd-22.01 but I soon became lost in my train of thought, everything seemed to be using codepoints as the metric for calculating padding etc so I couldn't really find anything wrong. Assuming there is anything wrong which I am not sure of yet of course.

Hi,

The Mercury standard library only has the barest understanding of
Unicode, so string.format_table is limited in what it can do.
It approximates the display width of a string by counting code points,
but that is incorrect in general. Only some code points occupy one
column in a fixed-width output. But,

  - some code points occupy 2 columns, e.g. East Asian characters,
    some emoji
  - some code points occupy 0 columns, e.g. zero-width space,
    combining characters
  - Emoji Sequences can be rendered to varying widths depending on
    software support
  - and more?

string.format_table should actually segment sequences of code points to
"grapheme clusters" and measure the number of columns that each grapheme
cluster is expected to occupy. Furthermore, to handle right-to-left
scripts it would need perform bidirectional text reordering as well.

It would take a lot of supporting code and large data tables to
implement, and none of that exists in the Mercury standard library.
For a couple of reasons, I'm of the opinion that more extensive Unicode
support belongs in external libraries, but those libraries don't exist.
Sebastian Godelet once made a start here:
https://github.com/sebgod/mercury-unicode

If you just needed basic emoji characters to be handled correctly,
and your C library has a wcwidth() function, you could make a version of
format_table that gets the width of each code point from wcwidth().
There are also some implementations of wcwidth() that exist outside
of standard C libraries.

Peter


More information about the users mailing list