[m-users.] Announcement (aggregates module) + questions (window functions)

Richard O'Keefe raoknz at gmail.com
Tue Mar 7 12:34:42 AEDT 2023
Previous message: [m-users.] Announcement (aggregates module) + questions (window functions)
Next message: [m-users.] Existential types question
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Oh dear.  NEVER defer to experts.  Listen to experts,
use your own judgement, and measure, measure, measure.

Consider some like (Prolog):
  :- p(X, a, Z),
     q(Z, W),
     r(W, X, Z).

This, I take it, is what you mean when you refer to
"non-determinism".  It appears to be a three-nested-
loop search.  But it might not be!  It depends on
what level of indexing the Prolog system offers.
The call to p/3 *might* be a simple lookup;
the call to q/2 *might* also be a simple lookup
(given Z), and then the call to r/3 might be a search.
It's exactly the same with joins in SQL.  A join
*might* be a big search, it *might* be a merge, it
*might* be a linear time hashed join, or it might
even about to a single lookup.  That's actually the
basic point of SQL.  Divide data base design into
(a) setting up a "language" in which every query
you might need is expressible, then (b) choosing an
indexing structure in which the queries you DO turn
out to need are acceptably efficient.

This is a "separation of concerns".  First worry
about correctness, then worry about efficiency.
This has implications for language design (it has
to be POSSIBLE to add appropriate indexing support)
and sad to say, this is one area where many Prolog
systems did not do well.  It also has implications
for readability, both good and bad.  The good
implication is that if you are trying to see what
a code fragment computes, it isn't cluttered up with
implementation detail.  The bad implication is that
the design work to make the code efficient is not
*there* in the code, it's somewhere else in the
index declarations, so if you want to understand HOW
the code performs acceptably, you have more reading
work to do.

I'll skip to the final point.
"When is the elegance of a non-deterministic solution acceptable given its
inefficiency?"
Answer 1:  WHEN YOU ACCEPT IT.  That's really the only
answer there is.  It's acceptable when it satisfies YOUR
criteria for acceptability.  It depends on what you need
that code to be/do.  Fast enough is fast enough.

Answer 2:  THERE IS A FALSE PRESUPPOSITION IN THE QUESTION.
"given its inefficiency"?  I've just explained that the
efficiency of a query in Prolog or SQL depends on what the
indexing support is.  The very same query might be very
efficient or very inefficient.  In the case of Mercury,
the compiler tries to reorder your code -- much like an
SQL optimiser would -- to improve its efficiency.
DEC-10 Prolog and its faithful successors came with a
source code rewriting feature you could use (admittedly
with much effort) to replace a clear but inefficient
subquery with whatever you thought would be better.
(InterLISP had, no, has, Medley is open source now, a
similar feature.  Erlang has a similar feature which IS
used for query optimisation.)  For that matter, there
seems to be a presupposition in the question that
deterministic code is efficient.  That is not necessarily
the case.  Consider the classic example of computing
Fibonacci numbers.  The obvious algorithm for F(n) takes
O(2**n) time.  A straightforward algorithm takes O(n)
time.  A matrix algorithm (which is perfectly obvious once
it has been explained to you) takes O(log n) time.  All
three algorithms are deterministic.

So:
- be clear about what your problem is and how efficient
  it needs to be
- write a simple program to solve the problem and play
  with some test cases to make sure that your solution
  is correct
- profile the program on larger examples to see where
  the shoe pinches
- add declarations to help the compiler do better ...
- add supporting indices or other data structures as
  needed
- keep checking that your improvements don't break anything
- it's acceptable when you accept it.

Just today I read this post
https://cohost.org/mononcqc/post/1128460-on-the-hunt-for-a-bu
about a small change to some code improving the throughput
of the program from 50 kB/sec to 30 MB/sec (in Erlang).
At a general level, we can say that the author
- Recognised a performance problem
- Used profiling tools to characterise it better
- Understood his (functional!) language well enough
  to understand what might be causing it
- did an experiment to check the hypothesis.

On Sun, 5 Mar 2023 at 04:26, Mark Clements <mark.clements at ki.se> wrote:

> Zoltan and Richard,
>
> I drafted a response a few months ago that I never sent: given that my
> main field is not computer science, I deferred to you.
>
> As a follow-up: when should one use non-determinism and when should one
> use Mercury? To motivate my questions: I came to Mercury because I was
> looking for a typed "Prolog" that allowed for elegant joins between
> relationships (essentially to replace SQL:). Zoltan's implementation for
> the CSV example is canonical and efficient code - but it is also
> comparatively long and uses several data structures that could be replaced
> by less efficient non-determinism. When is the elegance of a
> non-deterministic solution acceptable given its inefficiency?
>
> Sincerely, Mark.
>
> ------------------------------
> *From:* Richard O'Keefe <raoknz at gmail.com>
> *Sent:* 31 December 2022 04:57
> *To:* zoltan.somogyi at runbox.com <zoltan.somogyi at runbox.com>
> *Cc:* Mark Clements <mark.clements at ki.se>; users <
> users at lists.mercurylang.org>
> *Subject:* Re: [m-users.] Announcement (aggregates module) + questions
> (window functions)
>
> You don't often get email from raoknz at gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
> I'd like to second Zoltan's point.
> I've solved that particular Rosetta Code problem in two different
> programming languages, using an imperative style with hash tables,
> and modulo the usual hand-waving about hash-tables being "O(1)"
> -- which is fair enough in this case -- the code is obviously
> linear in the size of the two CSV files the program has to read.
>
> I'm reminded of my experience with XPath, XQuery, and XSLT, that
> sometimes it is simpler and more readable NOT to use a specialised
> query language but to use plain high level code in a language with
> decent facilities.  (Such as Mercury.)  Not coincidentally, XPath,
> XQuery, and XSLT hide practically everything you need to understand
> performance.
>
>
> On Sat, 31 Dec 2022 at 08:35, Zoltan Somogyi <zoltan.somogyi at runbox.com>
> wrote:
>
>
> 2022-12-30 00:27 GMT+11:00 "Mark Clements" <mark.clements at ki.se>:
> > %% patient and visit are nondet predicates
> > main(!IO) :-
> >     print_line("{Id, RowNumber, Date, Score, CumScore}", !IO),
> >     aggregate((pred({Id,RowNumber,Datei,Scorei,CumScorei}::out) is
> nondet :-
> >                    patient(Id,_),
> >                    Combined = (pred(Date::out,Score::out) is nondet :-
> visit(Id,Date,Score)),
> >                    Combined(Datei,Scorei),
> >                    bag_cum_sum(Combined)(Datei,CumScorei),
> >                    Dates = (pred(Date::out) is nondet :-
> Combined(Date,_)),
> >                    bag_row_number(Dates)(Datei,RowNumber)),
> >               print_line,
> >               !IO).
> >
> > Third, I have sought to stay with nondet predicates, with the
> implementation internally using lists -- is there a better approach?
>
> I think that getting your data from nondet predicates is fundamentally a
> bad idea.
> The reason is simple: it bakes the data into the program. If you want to
> run
> the same task on a different data set, you have to modify the program and
> recompile it.
> This is much less convenient than a program that you can run on a
> different data set
> simply by invoking it with different file names.
>
> The attached code is my solution to the same task on rosettacode.org
> <https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Frosettacode.org%2F&data=05%7C01%7Cmark.clements%40ki.se%7C585dbab45ec34f7f25ef08daeae32777%7Cbff7eef1cf4b4f32be3da1dda043c05d%7C0%7C0%7C638080559250227740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=tHDpCagECzIIv%2FsnHGSobrwGJS4wgMOfxMwkUswQ%2BBo%3D&reserved=0>
> .
> You will note that its operative part is longer than your code above,
> but it is also much simpler, and therefore easier to read and to
> understand. It is also easier to reason about its performance.
> For example,
>
> -  the main operation loops over all visits,
> - the non-constant-time part of each iteration consists of lookup and
> update operations
>   on its main data structure, VisitDataMap,
> - VisitDataMap is a map, and is therefore implemented using balanced trees.
>
> This makes it clear that its complexity is O(N log N), where N is the
> number of visits.
> (Technically, it is O(N log M), where M is the number of unique patients,
> but the
> difference is negligible.) By contrast, I cannot tell anything about the
> performance
> of the aggregate-using code above, because all the relevant details are
> hidden
> behind abstraction boundaries, whose documentation is silent about
> performance.
> Note that in the SQL programs from which you draw your inspiration,
> selecting
> the right set of indexes for each relation is usually an important design
> concern.
>
> Zoltan._______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mercurylang.org%2Flistinfo%2Fusers&data=05%7C01%7Cmark.clements%40ki.se%7C585dbab45ec34f7f25ef08daeae32777%7Cbff7eef1cf4b4f32be3da1dda043c05d%7C0%7C0%7C638080559250227740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=ifg5Y1A2bX92Y5sjcVVwesIHjFXY42k0fKIydxoUXZs%3D&reserved=0>
>
>
>
> *När du skickar e-post till Karolinska Institutet (KI) innebär detta att
> KI kommer att behandla dina personuppgifter. *Här finns information om
> hur KI behandlar personuppgifter
> <https://ki.se/medarbetare/integritetsskyddspolicy>.
>
>
> *Sending email to Karolinska Institutet (KI) will result in KI processing
> your personal data.* You can read more about KI’s processing of personal
> data here <https://ki.se/en/staff/data-protection-policy>.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20230307/e1aa5d28/attachment-0001.html>
Previous message: [m-users.] Announcement (aggregates module) + questions (window functions)
Next message: [m-users.] Existential types question
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the users mailing list