<html dir="ltr"><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body class="" style="text-align:left; direction:ltr;"><div>Hi</div><div><br></div><div>I've imported some unicode handling functions from the glib C library. Works well.</div><div><br></div><div>Bye</div><div>Volker</div><div><br></div><div><br></div><div>Am Montag, dem 27.06.2022 um 09:15 +0100 schrieb Sean Charles (emacstheviking):</div><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">Hi Peter,<div class=""><br class=""></div><div class="">Thanks for the details, it makes it all a lot clearer. Over the years, especially when I worked at an SMS company, my life was made hell by character conversions across hardware encoders and countries with different character sets.</div><div class="">It comes as no surprise then to hear what a hellish journey it would be to make this actually work the way I thought it would.</div><div class="">But... no matter, it was only me playing around with some features from the string library that's all.</div><div class=""><br class=""></div><div class="">I recently started this as a learning exercise for myself and others and that was when I noticed.</div><div class=""><br class=""></div><div class=""><a href="https://github.com/emacstheviking/mercury-library-samples" class="">https://github.com/emacstheviking/mercury-library-samples</a></div><div class=""><br class=""></div><div class="">I am in the process of uploading a second DCG that is a little more refactored from the first sample.m and shows having main as cc_multi, as well as me pulling the finger out and refactoring in general with it.</div><div class=""><br class=""></div><div class="">Thanks again,</div><div class="">Sean.</div><div class=""><br class=""><div><br class=""><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex"><div class="">On 27 Jun 2022, at 08:41, Peter Wang <<a href="mailto:novalazy@gmail.com" class="">novalazy@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">On Sun, 26 Jun 2022 13:29:21 +0100 "Sean Charles (emacstheviking)" <<a href="mailto:objitsu@gmail.com" class="">objitsu@gmail.com</a>> wrote:<br class=""><blockquote type="cite" style="margin:0 0 0 .8ex; border-left:2px #729fcf solid;padding-left:1ex">Hi,<br class=""><br class="">I just tried using the string.format_table function and it produces great output with simple code points but when I added the Smiley face, the layout has broken but it might be a terminal issue? I am using iTerm2 on Monterey.<br class="">Is this the expected behaviour or is it an issue in the rendering code?<br class="">It feels like the extra code unit for the Smiley internal storage has not been taken into account when calculating the padding.<br class=""><br class="">I took a look at the source code for string.m, the pad_row() predicate, lines 5206 to 5243 of mercury-srcdist-rotd-22.01 but I soon became lost in my train of thought, everything seemed to be using codepoints as the metric for calculating padding etc so I couldn't really find anything wrong. Assuming there is anything wrong which I am not sure of yet of course.<br class=""></blockquote><br class="">Hi,<br class=""><br class="">The Mercury standard library only has the barest understanding of<br class="">Unicode, so string.format_table is limited in what it can do.<br class="">It approximates the display width of a string by counting code points,<br class="">but that is incorrect in general. Only some code points occupy one<br class="">column in a fixed-width output. But,<br class=""><br class=""> - some code points occupy 2 columns, e.g. East Asian characters,<br class=""> some emoji<br class=""> - some code points occupy 0 columns, e.g. zero-width space,<br class=""> combining characters<br class=""> - Emoji Sequences can be rendered to varying widths depending on<br class=""> software support<br class=""> - and more?<br class=""><br class="">string.format_table should actually segment sequences of code points to<br class="">"grapheme clusters" and measure the number of columns that each grapheme<br class="">cluster is expected to occupy. Furthermore, to handle right-to-left<br class="">scripts it would need perform bidirectional text reordering as well.<br class=""><br class="">It would take a lot of supporting code and large data tables to<br class="">implement, and none of that exists in the Mercury standard library.<br class="">For a couple of reasons, I'm of the opinion that more extensive Unicode<br class="">support belongs in external libraries, but those libraries don't exist.<br class="">Sebastian Godelet once made a start here:<br class=""><a href="https://github.com/sebgod/mercury-unicode" class="">https://github.com/sebgod/mercury-unicode</a><br class=""><br class="">If you just needed basic emoji characters to be handled correctly,<br class="">and your C library has a wcwidth() function, you could make a version of<br class="">format_table that gets the width of each code point from wcwidth().<br class="">There are also some implementations of wcwidth() that exist outside<br class="">of standard C libraries.<br class=""><br class="">Peter<br class=""></div></div></blockquote></div><br class=""></div><pre>_______________________________________________</pre><pre>users mailing list</pre><a href="mailto:users@lists.mercurylang.org"><pre>users@lists.mercurylang.org</pre></a><pre><br></pre><a href="https://lists.mercurylang.org/listinfo/users"><pre>https://lists.mercurylang.org/listinfo/users</pre></a><pre><br></pre></blockquote></body></html>