[m-users.] Layout issue with function string.format_table/2 and emoticons in the source strings

Volker Wysk post at volker-wysk.de
Mon Jun 27 19:33:58 AEST 2022


Hi
I've imported some unicode handling functions from the glib C library. Works well.
ByeVolker

Am Montag, dem 27.06.2022 um 09:15 +0100 schrieb Sean Charles (emacstheviking):
> Hi Peter,
> Thanks for the details, it makes it all a lot clearer. Over the years, especially when I worked at an SMS company, my life was made hell by character conversions across hardware encoders and countries with different character sets.
> It comes as no surprise then to hear what a hellish journey it would be to make this actually work the way I thought it would.
> But... no matter, it was only me playing around with some features from the string library that's all.
> 
> I recently started this as a learning exercise for myself and others and that was when I noticed.
> 
> https://github.com/emacstheviking/mercury-library-samples
> 
> I am in the process of uploading a second DCG that is a little more refactored from the first sample.m and shows having main as cc_multi, as well as me pulling the finger out and refactoring in general with it.
> 
> Thanks again,
> Sean.
> 
> 
> > On 27 Jun 2022, at 08:41, Peter Wang <novalazy at gmail.com> wrote:
> > 
> > On Sun, 26 Jun 2022 13:29:21 +0100 "Sean Charles (emacstheviking)" <objitsu at gmail.com> wrote:
> > > Hi,
> > > 
> > > I just tried using the string.format_table function and it produces great output with simple code points but when I added the Smiley face, the layout has broken but it might be a terminal issue? I am using iTerm2 on Monterey.
> > > Is this the expected behaviour or is it an issue in the rendering code?
> > > It feels like the extra code unit for the Smiley internal storage has not been taken into account when calculating the padding.
> > > 
> > > I took a look at the source code for string.m, the pad_row() predicate, lines 5206 to 5243 of mercury-srcdist-rotd-22.01 but I soon became lost in my train of thought, everything seemed to be using codepoints as the metric for calculating padding etc so I couldn't really find anything wrong. Assuming there is anything wrong which I am not sure of yet of course.
> > 
> > Hi,
> > 
> > The Mercury standard library only has the barest understanding of
> > Unicode, so string.format_table is limited in what it can do.
> > It approximates the display width of a string by counting code points,
> > but that is incorrect in general. Only some code points occupy one
> > column in a fixed-width output. But,
> > 
> >   - some code points occupy 2 columns, e.g. East Asian characters,
> >     some emoji
> >   - some code points occupy 0 columns, e.g. zero-width space,
> >     combining characters
> >   - Emoji Sequences can be rendered to varying widths depending on
> >     software support
> >   - and more?
> > 
> > string.format_table should actually segment sequences of code points to
> > "grapheme clusters" and measure the number of columns that each grapheme
> > cluster is expected to occupy. Furthermore, to handle right-to-left
> > scripts it would need perform bidirectional text reordering as well.
> > 
> > It would take a lot of supporting code and large data tables to
> > implement, and none of that exists in the Mercury standard library.
> > For a couple of reasons, I'm of the opinion that more extensive Unicode
> > support belongs in external libraries, but those libraries don't exist.
> > Sebastian Godelet once made a start here:
> > https://github.com/sebgod/mercury-unicode
> > 
> > If you just needed basic emoji characters to be handled correctly,
> > and your C library has a wcwidth() function, you could make a version of
> > format_table that gets the width of each code point from wcwidth().
> > There are also some implementations of wcwidth() that exist outside
> > of standard C libraries.
> > 
> > Peter
> 
> _______________________________________________users mailing listusers at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20220627/25f3a2a8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.mercurylang.org/archives/users/attachments/20220627/25f3a2a8/attachment-0001.sig>


More information about the users mailing list