[m-users.] String comparison when compiling to C

Zoltan Somogyi zoltan.somogyi at runbox.com
Sun Sep 1 04:52:14 AEST 2024



On Wed, 28 Aug 2024 21:44:11 +1200, "Richard O'Keefe" <raoknz at gmail.com> wrote:
> I knew *what* the difference was, and wanted a better handle on *why* it was.

I don't think either I, or anyone else, can give you that handle.
It was too long ago, and the people who made that decision
(probably Fergus Henderson, maybe Tom Conway) are no longer
with the Mercury project, so I doubt they would remember either.
Sorry.

We did need both lowest-possible-overhead less-than and greater-than
operations on builtin numeric types, and a generic compare operation
(essentially an Ord class that *all* types implicitly belong to) that necessarily
has higher overhead. I guess using < and > for the specific ops follows
the usual tradition in imperative languages, while using @< and @>
for the generic ops follows a more Prolog-like tradition.

> I used to teach a few classes about internationalisations,

By "a few classes", do you mean a lecture or two on i18n in a more general
subject, or a whole subject devoted to i18n? At Unimelb, we only ever
had the former.

> and am
> painfully aware that not only is there no "one size fits all" way to
> compare strings for all languages, there isn't even one general
> purpose way to compare strings for English.  Dictionary order, phone
> book order, and 'library order' as spelled out in Knuth, AoCP, v3 do
> not agree with each other, and whatever Linux uses to sort in ls(1)
> appears to disagree with all three; it certainly violates my
> expectations.

Agreed.

> Of course, this makes compare(<, 'Foo', 'foo') precisely as
> problematic as 'Foo' < 'foo'.

If your expectations are the same for a generic compare op as
for a specific one. For data structures such as maps, the exact way
that the comparison is done does not actually matter for *most* operations;
it is enough that builtin.compare obey the usual laws of comparison.
(Such as A < B and B < C implying A < C.)

The operations for which it *does* matter are the ones that convert
maps to or from sorted assoc_lists, and some similar ops. For those,
you may need to convert an assoc_list that is sorted using builtin.compare
to an assoc_list that is sorted by your chosen comparsion op, or vice versa.
However, I myself have not yet found any need for such code.

> ISO 14651 "International string ordering and comparison -- Method for
> comparing character strings and description of the common template
> tailorable ordering" does exist, of course.   That standard contains a
> useful tutorial section at the end (annex D) which cites Règles du
> classement alphabétique en langue française et procédure informatisée
> pour le tri, Alain LaBonté, Ministère des Communications du Québec,
> 1988-08-19, ISBN 2-550-19046-7 for guidance on how it should work.

Most people don't know or care that such things exist. I do know that
one of my year-mates at university told me later, when he worked in private
industry, that he tried to get a copy of an ISO standard, but was told by his boss
that the price they would need to pay for that standard was too high (several hundred
dollars). That can discourage people, and they may assume that they need pay for
all standards, even though some (like 14651) are now publicly available.

And I would think it is possible that "man strcoll" does not mention that standard
because (a) it did not yet exist when that man page was written, and (b)
for anyone who later came along who knew both the standard and its relationship
to strcoll, the burden of tracking down the maintainers of that man page, and
getting a patch accepted, was either assumed or proven to be too hard :-(

> I'm not sure that even Mercury programmers could *easily* code string
> comparison for themselves.  Back when 14651 was only an adorable
> little 150-page DIS puppy I tried implementing a fast single-pass
> algorithm for English only based on what it said, based on the idea of
> *noting* differences without *acting* on them until later.  If only
> ISO Latin 1 hadn't contained a bunch of 2-letter ligatures I'd have
> succeeded, but there I ran aground.  (Not that I couldn't do it at
> all, but that the fast single-pass method couldn't be made to work.)

I meant that they could easily implement an algorithm *once they decided
what algorithm they want*.

> So yes again "S1
> compared with S2" should *in general" be "S1 compared with S2 for
> purpose P in locale L" and you can't really use an infix operator for
> a 4-operand construct.

Of course. But a four-operand semidet predicate works.

> It's all very well to say that S1 < S2 is the wrong tool for many use
> cases.  So is compare(<, S1, S2) for exactly the same reasons.  For
> any use case where compare(<, S1, S2) *is* acceptable, so would S1 <
> S2 have been.

You *can* use @< and @> whereever you use builtin.compare.
They are actually implemented on top of builtin.compare.

In my reply to the original question, I talked about builtin.compare
because map.m and similar library modules need to do case analysis
over <, = and >, and for that, builtin.compare is more suitable than @< and @>.

Zoltan.



More information about the users mailing list