[m-rev.] for review: Runtime granularity control changes.

Paul Bone pbone at csse.unimelb.edu.au
Mon Jul 6 14:29:43 AEST 2009


On Fri, Jun 12, 2009 at 02:03:21PM +1000, Paul Bone wrote:
> On Fri, Jun 12, 2009 at 11:26:16AM +1000, Peter Wang wrote:
> > 
> > You may remember I posted a patch that implements work stealing.  I still
> > haven't committed that (forgot about it), but if I did then the runtime
> > granularity control, as it exists now, would have to be discarded.
> > 
> > The reason is this.  Currently there is a single global spark queue, and you
> > try to guess, ahead of time, by whatever means, whether it's likely that a
> > spark will be taken up to be executed in parallel.  So the granularity
> > control macros are used to help you guess.
> > 
> > Under work stealing, there are multiple spark dequeues.  When you create a
> > spark, you *always* put it on a local queue, no guess required, as it is, by
> > design, a cheap operation.  When an engine becomes idle, it can steal a spark
> > from any one of those dequeues.  The nice part is that there is no trying to
> > see into the future, making a mistake and losing the potential parallelism.
> > The bad part is that an idle engine doesn't know exactly which deques are
> > non-empty or if any of them are non-empty, which can result in wasted effort.
> > The hand wavy argument is that it's not a problem if your program is indeed
> > highly parallel, i.e. #sparks >> #engines
> > 
> > I think it's up to you which way to proceed, or both if you can manage the
> > complexity in the runtime.  It would be best to do an evaluation on a real
> > parallel program, if there is one.
> > 
> 
> I've been meaning to discuss work stealing and other approaches with
> yourself and Zoltan.
>
> *snip*
>
> I can see that work stealing may improve this.  And perhaps there are
> methods we can use that can allow an idle engine to quickly determine
> which other engine to steal work from.
> 
> I'm also concerned that evaluating sparks in any other order but the
> order they are scheduled is sub-optimal.  You've suggested that work
> should be stolen from the 'cold end' of a queue.  Do you mean the end
> where items are 'enqueued'?  If the queue's size is constant over (a
> large enough) time both ends will be equally busy.
> 
> I think in general sparks should be executed in the order they are
> scheduled.  It may also be an optimisation to schedule sparks based on
> four queues (or similar).
> 
>    1) Sparks that will signal futures and will not wait on any futures.
>    2) Independent sparks
>    3) Sparks that both signal and wait on futures.
>    4) Sparks that wait on futures but do not signal futures.
> 
> When looking for parallel work an engine should attempt to get work from
> queue 1 first followed by queue 2 and so on.
> 
> Given all this I'm really not sure what we'll end up with. :-)  I'll be
> interested to read your comments.

I've re-read parts of your paper (Section 4.5) and I have a better
understanding of your intentions for work stealing.  In particular I
understand why you intend to use dequeues.  Which are used as stacks
locally and queues remotely.

When a dependant parallel conjunction (A & B) is executed a spark
representing B is enqueued locally.  while A is executed.  If it is not
stolen it will be executed _immediatly_ following A since it's pop'ed
off the 'stack'.

My above proposal was invented with the concern that by using stacks
sometimes and fifos at other times the execution of goals would be
re-ordered and a consumer may be executed before it's producer, making
it more-likely to block consumming time and resources in the
context-switch (and creation of new context).  However this cannot
happen Since either B will be stolen and executed in parallel or it will
be executed sequentially following A (where it doesn't block).

This is also true for larger conjunctions such as (A & B & C) since a
spark represents the computations of both B and C.

If a goal such as ( (A & B) & C ) exists and C depends on a value
produced by B.  C is placed on the stack first followed by B.  Once A
finishes excuting B will be executed followed by C provided no work has
been stolen.  If work was stolen C will be stolen first and may block if
B takes a while to produce a value.  This is probably not a common
pattern though, and even if B where stolen before C, C may still have
blocked.

I don't think my multiple queues idea above is valueable, dequeues seem
effective enough.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 489 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20090706/b3389ada/attachment.sig>


More information about the reviews mailing list