[m-users.] String comparison when compiling to C

Richard O'Keefe raoknz at gmail.com
Wed Aug 28 19:44:11 AEST 2024


I knew *what* the difference was, and wanted a better handle on *why* it was.
I used to teach a few classes about internationalisations, and am
painfully aware that not only is there no "one size fits all" way to
compare strings for all languages, there isn't even one general
purpose way to compare strings for English.  Dictionary order, phone
book order, and 'library order' as spelled out in Knuth, AoCP, v3 do
not agree with each other, and whatever Linux uses to sort in ls(1)
appears to disagree with all three; it certainly violates my
expectations.
Of course, this makes compare(<, 'Foo', 'foo') precisely as
problematic as 'Foo' < 'foo'.

ISO 14651 "International string ordering and comparison -- Method for
comparing character strings and description of the common template
tailorable ordering" does exist, of course.   That standard contains a
useful tutorial section at the end (annex D) which cites Règles du
classement alphabétique en langue française et procédure informatisée
pour le tri, Alain LaBonté, Ministère des Communications du Québec,
1988-08-19, ISBN 2-550-19046-7 for guidance on how it should work.

I'm not sure that even Mercury programmers could *easily* code string
comparison for themselves.  Back when 14651 was only an adorable
little 150-page DIS puppy I tried implementing a fast single-pass
algorithm for English only based on what it said, based on the idea of
*noting* differences without *acting* on them until later.  If only
ISO Latin 1 hadn't contained a bunch of 2-letter ligatures I'd have
succeeded, but there I ran aground.  (Not that I couldn't do it at
all, but that the fast single-pass method couldn't be made to work.)

I note that java.lang.String provides a locale-independent
String.compareTo() not entirely unlike C's strcmp() in spirit, with
deferring to java.text.Collator for locale-sensitive methods.  That's
not unreasonable.  What bothers me is that java.text.Collator provides
four "strengths" whereas the DIS that I was studying those years ago
named several existing languages that needed more, up to seven.  What
also bothers me is the statement that "RuleBasedCollator is designed
to be fully compliant to the Unicode Collation Algorithm (UCA) and
conforms to ISO 14651." but that the UCA and ISO 14651 disagree.  The
UCA calls for unicode normalisation and 14651 does not.  The led to
for example, strcoll() normalising in Windows but not in Linux or BSD.

So yes, 100% agreement that string comparison is an area where we have
difficult problems, made more difficult by things like 'man strcoll'
on Linux mentioning NEITHER UCA nor 14651.  And in my view, the
strcoll algorithm giving the WRONG order for files.  (I want foo.m <
foo.o < fool.m < fool.o < fooled.m < fooled.o,  What I get is fooled.m
< fooled.o < fool.m < fool.o  < foo.m <   foo.o, blatantly
contradicting  foo < fool < fooled .  Sigh.)   So yes again "S1
compared with S2" should *in general" be "S1 compared with S2 for
purpose P in locale L" and you can't really use an infix operator for
a 4-operand construct.

It's all very well to say that S1 < S2 is the wrong tool for many use
cases.  So is compare(<, S1, S2) for exactly the same reasons.  For
any use case where compare(<, S1, S2) *is* acceptable, so would S1 <
S2 have been.

One issue is that compare/3 </2 and friends are very nearly as
problematic for floats.

I like this comment:
  It's precisely because this problem space is truly gnarly and
requires ridiculous levels of intelligence and experience
  that it shouldn't be left to every individual developer to not only
address but also to even know they need to know it.
from this thread about string comparison in Swift (which made my head hurt):
 https://forums.swift.org/t/swift-string-comparison-doesnt-consider-ligatures-equivalent-to-their-components/66665/31

On Tue, 27 Aug 2024 at 23:27, Zoltan Somogyi <zoltan.somogyi at runbox.com> wrote:
>
>
> On Mon, 26 Aug 2024 12:49:42 +1200, "Richard O'Keefe" <raoknz at gmail.com> wrote:
> > It's not clear to me why S1 < S2 is any more or less type-specific
> > than compare(<, S1, S2).
>
> If you are talking in the abstract, the two notions are equivalent.
> If you are talking in the concrete case of Mercury, the two are diffferent.
>
> The compare predicate in the builtin module of the standard library
> has the signature
>
>   :- pred compare(comparison_result, T, T).
>
> and can thus compare pairs of values of any type. Its implementation
> uses runtime type information (RTTI), which has a bit of overhead.
>
> The standard library modules that define numerical types, which are
> int.m, uint.m, their 8-, 16-, 32- and 64-bit versions, rational.m and float.m,
> define predicates named <, >, =< and >= that each operate on values
> of the type that the module is named for. All of these predicates, with
> the exception of the ones in rational.m, are builtins, and are implemented
> with essentially no overhead.
>
> In our view, this gives users the freedom to choose what they value more;
> genericity, or performance.
>
> Note that while <, >, =< and >= are not generic operations, builtin.m
> does define predicates named @<, @>, @=< and @>=. These *are* generic,
> because they are implemented in terms of builtin.compare.
>
> This thread started because string.m does not define <. That may seem strange
> at first glance, but whatever comparison method we picked (case sensitive
> vs case insensitive, respecting white space vs ignoring white space, etc)
> would be right for some use cases but wrong for others. And in practice,
> many programs (I am not claiming all, or even most programs, but many)
> can simply do without such a predicate. This is because the most common
> use case for string comparison is using strings as keys in maps, and that
> is code that users don't have to write, because the Mercury standard library
> provides several modules (such as map.m) that already provide that capability.
> They do it by using builtin.compare, and for their purposes, it does not matter
> *which* comparison method that predicate uses.
>
> If someone needs a comparison predicate on strings that works in a given
> specific way, they are free to implement it themselves. It is not hard.
> Having seen some of the web pages that pass for documentation for the libraries
> of languages such as C# and Java, I wouldn't be surprised if it took less time
> for someone to implement a string comparison predicate in Mercury
> than it would take them to select the right method from a C# or Java class,
> *and convince themselves that it does what it says/implies it does* ,
> especially if one cares about what the comparison does in the presence
> of non-well-formed unicode strings :-(
>
> Zoltan.


More information about the users mailing list