[m-users.] String comparison when compiling to C

Richard O'Keefe raoknz at gmail.com
Fri Sep 6 21:29:10 AEST 2024


I meant a couple of lectures on I18N and L10N.  Sadly.
Things like "use library functions to read, write, and compute with dates."
(This got me in serious trouble for Islamophobia when I pointed out the
difficulties inherent in an observation-based calendar that doesn't agree
with growing seasons and suggested that the Creator would have to have
understood the Sun-Earth-Moon system better than that...)
Things like "use message catalogues for text that might need translating".
Things like "when to use strcmp and when and why to use strcoll".
Things like "use library functions to read and print money".

The Linux man page for strcoll was revised on 2017-09-15.
That's *after* the ISO standard came out.
The problem remains that
 - I have manual page sets for 4 commercial and 4 free UNIX systems
   not to mention copies of SUSv3 and SUSv4, and
 - NONE of them explains the ordering used by strcoll(3) for ANY
   locale other than the POSIX/C locale.
   Case in point: there are historically four ways to write Māori.
   * Maori (defective orthography, the oldest, devised by Europeans)
   * Maaori (double-vowel orthography, devised by Māori people)
   * Māori (macron orthography, devised by some official body or other so that
     vowel length could be indicated on old buildings &c by just
adding a bar instead
     of repainting or recarving the whole text)
  * Mäori (fake-font orthography, devised by the government so that
text could be
    claimed to be and be processed as ISO Latin 1, but with faked fonts could
    display correctly).
  In a Māori locale, you'd like as many as possible of these to
collate together.
  Do they?  There is no way other than experiment to find out.
  (The answer is no.  At least in the mi_NZ.UTF-8 locale in Ubuntu 22.04.)

This creates a problem for Mercury.  There is no portable framework
for string comparison
to plug into that lets people find out what localised comparison they
are getting.

I must say that no manual page that I've seen documents strcoll()
*well*.  For example,
before using strcoll(), I for one would like to know that it
*allocates* a transformed copy
of each string, compares the transformed copies, and then frees them.  (Checked:
Darwin and Illumos(Solaris).)  strxfrm() is definitely the way to go,
but of course that is
subject to the same documentation failures as strcoll().

As for <quote>
I meant that they could easily implement an algorithm *once they decided
what algorithm they want*.
</quote>
No.  It really is not easy.  Let's try an old favourite, telephone book order.
I still have a copy of the 2018 local phone book...
"Business names which begin with numbers or which have numbers included
will be sorted according to the alphabetic sequence of the word, not the number.
E.g., 1A Taxi Company will be sorted as OneA Taxi Company.  246 Business Systems
HAS A CHOICE as to where it is sorted.  It could be placed alphabetically as
'Two four Six Business Systems' or as "Two Hundred and Forty Six
Business Systems."
"In determining alphabetic order, the prefixes "M'", "Mac", and "Mc"
are treated as "Mac"
and the fourth letter determines the position of the entry."
Although it doesn't actually mean letter, "Mc Caw" is expanded to "Mac
Caw" and sorts
a long way before "McCall".
Just dealing with the Scots is easy enough.  But the numbers issue
creates the problem
that the sorting rules do not define an algorithm.  (There is, in
fact, a company in NZ
called 246.)  And neither alternative allows for "538 Foobar" <
"Fiveash" < "555 Zabbo"
which (modulo Foobar and Zabbo being my substitutes for street names)
do occur in that order.
Needless to say, "turn numbers into words" is not something that can
be expressed in the
localedef(1) framework.

When I was feeling mean, I used to set "implement phone book order" as
an assignment.
Reminding students to accept Māori names as well as English ones...

In all seriousness, defining text collation isn't easy and
implementing it is NOT easy.
Which is why it is important to have something suitable in a
programming language.
It's certainly not an area where one can reasonably expect the Mercury
maintainers
to have any expertise, so the best one could hope for would be
   String.collate(Order, String1, String2)  % = strcoll
   String.collate(Order, String1, String2, Locale) % = strcoll_l
   String.sort_key(String) % = strxfrm
   String.sort_key(String, Locale) % = strxfrm_l, missing from Ubuntu 22
and the failure to document what you get wouldn't be the fault of *Mercury*.

I can see how to do something like this for C and Java.  What other
back-ends should
be catered for for something like this to be accepted in Mercury?  C#?
 Ada?  JavaScript?

 String.collate/3 or
perhaps better String.sort_key(String[, Locale])



On Sun, 1 Sept 2024 at 06:52, Zoltan Somogyi <zoltan.somogyi at runbox.com> wrote:
>
>
>
> On Wed, 28 Aug 2024 21:44:11 +1200, "Richard O'Keefe" <raoknz at gmail.com> wrote:
> > I knew *what* the difference was, and wanted a better handle on *why* it was.
>
> I don't think either I, or anyone else, can give you that handle.
> It was too long ago, and the people who made that decision
> (probably Fergus Henderson, maybe Tom Conway) are no longer
> with the Mercury project, so I doubt they would remember either.
> Sorry.
>
> We did need both lowest-possible-overhead less-than and greater-than
> operations on builtin numeric types, and a generic compare operation
> (essentially an Ord class that *all* types implicitly belong to) that necessarily
> has higher overhead. I guess using < and > for the specific ops follows
> the usual tradition in imperative languages, while using @< and @>
> for the generic ops follows a more Prolog-like tradition.
>
> > I used to teach a few classes about internationalisations,
>
> By "a few classes", do you mean a lecture or two on i18n in a more general
> subject, or a whole subject devoted to i18n? At Unimelb, we only ever
> had the former.
>
> > and am
> > painfully aware that not only is there no "one size fits all" way to
> > compare strings for all languages, there isn't even one general
> > purpose way to compare strings for English.  Dictionary order, phone
> > book order, and 'library order' as spelled out in Knuth, AoCP, v3 do
> > not agree with each other, and whatever Linux uses to sort in ls(1)
> > appears to disagree with all three; it certainly violates my
> > expectations.
>
> Agreed.
>
> > Of course, this makes compare(<, 'Foo', 'foo') precisely as
> > problematic as 'Foo' < 'foo'.
>
> If your expectations are the same for a generic compare op as
> for a specific one. For data structures such as maps, the exact way
> that the comparison is done does not actually matter for *most* operations;
> it is enough that builtin.compare obey the usual laws of comparison.
> (Such as A < B and B < C implying A < C.)
>
> The operations for which it *does* matter are the ones that convert
> maps to or from sorted assoc_lists, and some similar ops. For those,
> you may need to convert an assoc_list that is sorted using builtin.compare
> to an assoc_list that is sorted by your chosen comparsion op, or vice versa.
> However, I myself have not yet found any need for such code.
>
> > ISO 14651 "International string ordering and comparison -- Method for
> > comparing character strings and description of the common template
> > tailorable ordering" does exist, of course.   That standard contains a
> > useful tutorial section at the end (annex D) which cites Règles du
> > classement alphabétique en langue française et procédure informatisée
> > pour le tri, Alain LaBonté, Ministère des Communications du Québec,
> > 1988-08-19, ISBN 2-550-19046-7 for guidance on how it should work.
>
> Most people don't know or care that such things exist. I do know that
> one of my year-mates at university told me later, when he worked in private
> industry, that he tried to get a copy of an ISO standard, but was told by his boss
> that the price they would need to pay for that standard was too high (several hundred
> dollars). That can discourage people, and they may assume that they need pay for
> all standards, even though some (like 14651) are now publicly available.
>
> And I would think it is possible that "man strcoll" does not mention that standard
> because (a) it did not yet exist when that man page was written, and (b)
> for anyone who later came along who knew both the standard and its relationship
> to strcoll, the burden of tracking down the maintainers of that man page, and
> getting a patch accepted, was either assumed or proven to be too hard :-(
>
> > I'm not sure that even Mercury programmers could *easily* code string
> > comparison for themselves.  Back when 14651 was only an adorable
> > little 150-page DIS puppy I tried implementing a fast single-pass
> > algorithm for English only based on what it said, based on the idea of
> > *noting* differences without *acting* on them until later.  If only
> > ISO Latin 1 hadn't contained a bunch of 2-letter ligatures I'd have
> > succeeded, but there I ran aground.  (Not that I couldn't do it at
> > all, but that the fast single-pass method couldn't be made to work.)
>
> I meant that they could easily implement an algorithm *once they decided
> what algorithm they want*.
>
> > So yes again "S1
> > compared with S2" should *in general" be "S1 compared with S2 for
> > purpose P in locale L" and you can't really use an infix operator for
> > a 4-operand construct.
>
> Of course. But a four-operand semidet predicate works.
>
> > It's all very well to say that S1 < S2 is the wrong tool for many use
> > cases.  So is compare(<, S1, S2) for exactly the same reasons.  For
> > any use case where compare(<, S1, S2) *is* acceptable, so would S1 <
> > S2 have been.
>
> You *can* use @< and @> whereever you use builtin.compare.
> They are actually implemented on top of builtin.compare.
>
> In my reply to the original question, I talked about builtin.compare
> because map.m and similar library modules need to do case analysis
> over <, = and >, and for that, builtin.compare is more suitable than @< and @>.
>
> Zoltan.
>


More information about the users mailing list