[m-rev.] for review: Runtime granularity control changes.

Paul Bone pbone at csse.unimelb.edu.au
Fri Jun 12 14:03:21 AEST 2009


On Fri, Jun 12, 2009 at 11:26:16AM +1000, Peter Wang wrote:
> 2009/6/12 Paul Bone <pbone at csse.unimelb.edu.au>:
> >
> > I've fixed the problems with this patch, some code within C macros had typos
> > that wern't detected until the macros where expanded by test cases that used
> > the parallelism feastures.
> >
> > For review by Zoltan,  Peter Wang, you may like to look at the atomic
> > operations I'm using.
> 
> Paul, I haven't looked at this yet.
> 
> You may remember I posted a patch that implements work stealing.  I still
> haven't committed that (forgot about it), but if I did then the runtime
> granularity control, as it exists now, would have to be discarded.
> 
> The reason is this.  Currently there is a single global spark queue, and you
> try to guess, ahead of time, by whatever means, whether it's likely that a
> spark will be taken up to be executed in parallel.  So the granularity
> control macros are used to help you guess.
> 
> Under work stealing, there are multiple spark dequeues.  When you create a
> spark, you *always* put it on a local queue, no guess required, as it is, by
> design, a cheap operation.  When an engine becomes idle, it can steal a spark
> from any one of those dequeues.  The nice part is that there is no trying to
> see into the future, making a mistake and losing the potential parallelism.
> The bad part is that an idle engine doesn't know exactly which deques are
> non-empty or if any of them are non-empty, which can result in wasted effort.
> The hand wavy argument is that it's not a problem if your program is indeed
> highly parallel, i.e. #sparks >> #engines
> 
> I think it's up to you which way to proceed, or both if you can manage the
> complexity in the runtime.  It would be best to do an evaluation on a real
> parallel program, if there is one.
> 

I've been meaning to discuss work stealing and other approaches with
yourself and Zoltan.

Currently I'm aware that a macro is used to determine if there is a
potentially free engine that can handle work if placed on the global
queue.  If true, work is scheduled globally, otherwise it's scheduled
locally.

What I've been working on is different.  For recursive procedures that
make two recursive calls such as quicksort or fibs it can be useful to
parallelize the (mostly-)independent recursive calls at the first few
levels of the call tree.  Later in the call tree this is not an
optimisation and a lot of parallel work may already be running.  One
solution is to turn the parallel conjunction containing the two
recursive calls into a "conditional parallel conjunction" which has the
following semantics:

    ( should_run_in_parallel ->
        (
            CallA &
            CallB
        )
    ->
        CallA,
        CallB
    )

There are other, arguably better solutions, such as always parallelizing
work at the top of the tree, conditionally parallelizing it in the
middle.  And running work sequentially towards the leaves of the tree.

The parallel conjunct containing CallB may be scheduled either locally
or globally, both because the test for how to schedule the work and the
should_run_in_parallel goal above need not be the same, and the
variables used for test tests may have changed after one decision has
been made but before the other one has.

Both granularity control methods aim to solve similar but different
problems.  The conditional parallel conjunction can be introduced in
code by the compiler when it may or may not be best to parallelize the
work.  It's strength is that it can be used in some parts of a program
but not others.  Whereas the local spark queues attempt to reduce the
costs of managing parallel work, mostly the synchronisation involved
with managing a global spark queue.

The main thing that concerns me with the local spark queues and local
scheduling is that one engine may decide to evaluate something locally,
continue it's current work.  In the mean time other engines finish their
work and check the global spark queue, once the global queue is empty an
engine must become idle, since it cannot yet steal work from a local
queue.

I can see that work stealing may improve this.  And perhaps there are
methods we can use that can allow an idle engine to quickly determine
which other engine to steal work from.

I'm also concerned that evaluating sparks in any other order but the
order they are scheduled is sub-optimal.  You've suggested that work
should be stolen from the 'cold end' of a queue.  Do you mean the end
where items are 'enqueued'?  If the queue's size is constant over (a
large enough) time both ends will be equally busy.

I think in general sparks should be executed in the order they are
scheduled.  It may also be an optimisation to schedule sparks based on
four queues (or similar).

   1) Sparks that will signal futures and will not wait on any futures.
   2) Independent sparks
   3) Sparks that both signal and wait on futures.
   4) Sparks that wait on futures but do not signal futures.

When looking for parallel work an engine should attempt to get work from
queue 1 first followed by queue 2 and so on.

Given all this I'm really not sure what we'll end up with. :-)  I'll be
interested to read your comments.

Thanks.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 489 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20090612/40590733/attachment.sig>


More information about the reviews mailing list