[m-users.] Layout issue with function string.format_table/2 and emoticons in the source strings

Sean Charles (emacstheviking) objitsu at gmail.com
Mon Jun 27 18:15:10 AEST 2022


Hi Peter,

Thanks for the details, it makes it all a lot clearer. Over the years, especially when I worked at an SMS company, my life was made hell by character conversions across hardware encoders and countries with different character sets.
It comes as no surprise then to hear what a hellish journey it would be to make this actually work the way I thought it would.
But... no matter, it was only me playing around with some features from the string library that's all.

I recently started this as a learning exercise for myself and others and that was when I noticed.

https://github.com/emacstheviking/mercury-library-samples <https://github.com/emacstheviking/mercury-library-samples>

I am in the process of uploading a second DCG that is a little more refactored from the first sample.m and shows having main as cc_multi, as well as me pulling the finger out and refactoring in general with it.

Thanks again,
Sean.


> On 27 Jun 2022, at 08:41, Peter Wang <novalazy at gmail.com> wrote:
> 
> On Sun, 26 Jun 2022 13:29:21 +0100 "Sean Charles (emacstheviking)" <objitsu at gmail.com> wrote:
>> Hi,
>> 
>> I just tried using the string.format_table function and it produces great output with simple code points but when I added the Smiley face, the layout has broken but it might be a terminal issue? I am using iTerm2 on Monterey.
>> Is this the expected behaviour or is it an issue in the rendering code?
>> It feels like the extra code unit for the Smiley internal storage has not been taken into account when calculating the padding.
>> 
>> I took a look at the source code for string.m, the pad_row() predicate, lines 5206 to 5243 of mercury-srcdist-rotd-22.01 but I soon became lost in my train of thought, everything seemed to be using codepoints as the metric for calculating padding etc so I couldn't really find anything wrong. Assuming there is anything wrong which I am not sure of yet of course.
> 
> Hi,
> 
> The Mercury standard library only has the barest understanding of
> Unicode, so string.format_table is limited in what it can do.
> It approximates the display width of a string by counting code points,
> but that is incorrect in general. Only some code points occupy one
> column in a fixed-width output. But,
> 
>  - some code points occupy 2 columns, e.g. East Asian characters,
>    some emoji
>  - some code points occupy 0 columns, e.g. zero-width space,
>    combining characters
>  - Emoji Sequences can be rendered to varying widths depending on
>    software support
>  - and more?
> 
> string.format_table should actually segment sequences of code points to
> "grapheme clusters" and measure the number of columns that each grapheme
> cluster is expected to occupy. Furthermore, to handle right-to-left
> scripts it would need perform bidirectional text reordering as well.
> 
> It would take a lot of supporting code and large data tables to
> implement, and none of that exists in the Mercury standard library.
> For a couple of reasons, I'm of the opinion that more extensive Unicode
> support belongs in external libraries, but those libraries don't exist.
> Sebastian Godelet once made a start here:
> https://github.com/sebgod/mercury-unicode
> 
> If you just needed basic emoji characters to be handled correctly,
> and your C library has a wcwidth() function, you could make a version of
> format_table that gets the width of each code point from wcwidth().
> There are also some implementations of wcwidth() that exist outside
> of standard C libraries.
> 
> Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20220627/51bf0c06/attachment.html>


More information about the users mailing list