[m-rev.] for review: improve parallel execution mechanism

Peter Wang novalazy at gmail.com
Thu Oct 11 21:51:24 AEST 2007
Previous message: [m-rev.] for review: improve parallel execution mechanism
Next message: [m-rev.] for review: improve parallel execution mechanism
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2007-09-04, Julien Fischer <juliensf at csse.unimelb.edu.au> wrote:
> 
>  On Tue, 4 Sep 2007, Peter Wang wrote:
> 
> > Branches: main
> >
> > Make the parallel conjunction execution mechanism more efficient.
> >
> > 1. Don't allocate sync terms on the heap.  Sync terms are now allocated in
> > the stack frame of the procedure call which originates a parallel
> > conjunction.
> >
> > 2. Don't allocate individual sparks on the heap.  Sparks are now stored in
> > preallocated, growing arrays using an algorithm that doesn't use locks.
> 
>  preallocated, growing arrays?

i.e. allocated before they are strictly necessary but growing
dynamically as required.

> 
> > 3. Don't have one mutex per sync term.  Just use one mutex to protect
> > concurrent accesses to all sync terms (it's is rarely needed anyway).  This
> > makes sync terms smaller and saves initialising a mutex for each parallel
> > conjunction encountered.
> >
> > 4. We don't bother to acquire the global sync term lock if we know a 
> > parallel
> > conjunction couldn't be executing in parallel.  In a highly parallel 
> > program,
> > the majority of parallel conjunctions will be executed sequentially so
> > protecting the sync terms from concurrent accesses is unnecessary.
> >
> >
> > par_fib(39) is ~8.4 times faster (user time) on my laptop (Linux 2.6, 
> > x86_64),
> 
>  Did you try any other benchmarks?

For a coarse-grained benchmark like icfp2000 there is no difference, as
expected.

hanoiapp(22) is ~10% faster than old parallel execution (user time,
using two engines).

mmatrix (really multiplication of lists of lists of integers) is
strange.  The old parallel execution is actually slightly faster for
larger matrices (500x500) terms, in terms of real time, even though in
terms of user time the new parallel execution mechanism is slightly more
efficient.  The program is bound by memory latency, so I don't put too
much stock in the results.

> > which is ~3.5 as slow as sequential execution.
> 
>  How much of that remaining overhead is down to GC?

Virtually none.  With a sensible number of engines used, not enough
memory is allocated for deques and contexts to cause the GC to kick in.

> > compiler/liveness.m:
> > compiler/proc_gen.m:
> > 	Disable use of the parallel conjunction operator in the compiler as
> > 	older versions of the compiler will generate code incompatible with
> > 	the new runtime.
> 
>  I take it you plan to add them back after this change has bootchecked?

In a week or two.

> > runtime/mercury_atomic.h:
> > 	New file to contain atomic operations.  Currently it just contains
> > 	compare-and-swap for gcc/x86_64, gcc/x86 and gcc-4.1.
> 
>  I think mercury_atomic_ops.h would be a better name.

Renamed.  I had to add mercury_atomic_ops.c with out-of-line definitions
for when C optimisations are disabled.

> > --- ./compiler/liveness.m	7 Aug 2007 07:09:57 -0000	1.156
> > +++ ./compiler/liveness.m	4 Sep 2007 07:30:35 -0000
> > @@ -241,7 +241,7 @@
> >         list.split_list(1000, PredIds, HeadPredIds, TailPredIds)
> >     then
> >         ( detect_liveness_preds_parallel_3(HeadPredIds, HLDS0, !HLDS)
> > -        & detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
> > +        , detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
> >         )
> >     else
> >         detect_liveness_preds_parallel_3(PredIds, HLDS0, !HLDS)
> > ===================================================================
> > RCS file: /home/mercury/mercury1/repository/mercury/compiler/proc_gen.m,v
> > retrieving revision 1.22
> > diff -u -r1.22 proc_gen.m
> > --- ./compiler/proc_gen.m	13 Aug 2007 03:01:43 -0000	1.22
> > +++ ./compiler/proc_gen.m	4 Sep 2007 07:30:35 -0000
> > @@ -171,7 +171,7 @@
> >         list.map_foldl(generate_pred_code_par(ModuleInfo0),
> >             PredIdsA, PredProceduresA, GlobalData0, GlobalDataA),
> >         list.condense(PredProceduresA, ProceduresA)
> > -    &
> > +    ,
> >         list.condense(ListsOfPredIdsB, PredIdsB),
> >         GlobalData1 = bump_type_num_counter(GlobalData0, type_num_skip),
> >         list.map_foldl(generate_pred_code_par(ModuleInfo0),
> 
>  Add an XXX comment mentioning that this should be a parallel
>  conjunction.  (And in the module above.)

Done.

> > Index: ./runtime/mercury_context.h
> > ===================================================================
> > RCS file: 
> > /home/mercury/mercury1/repository/mercury/runtime/mercury_context.h,v
> > retrieving revision 1.43
> > diff -u -r1.43 mercury_context.h
> > --- ./runtime/mercury_context.h	10 May 2007 05:24:16 -0000	1.43
> > +++ ./runtime/mercury_context.h	4 Sep 2007 07:30:35 -0000
> 
> 
> 
> > @@ -59,6 +59,14 @@
> > #include "mercury_goto.h"       /* for MR_GOTO() */
> > #include "mercury_conf.h"       /* for MR_CONSERVATIVE_GC */
> >
> > +#if !defined(MR_HIGHLEVEL_CODE) && defined(MR_THREAD_SAFE)
> > +  /*
> > +  ** Whether we are in a grade which supports the low-level parallel
> > +  ** conjunction execution mechanism.
> > +  */
> > +  #define MR_LL_PARALLEL_CONJ 1
> > +#endif
> 
>  mercury_conf_param.h would be a more appropriate place to define that
>  IMO.

Yes, moved.

> > Index: ./runtime/mercury_atomic.h
> > ===================================================================
> > RCS file: ./runtime/mercury_atomic.h
> > diff -N ./runtime/mercury_atomic.h
> > --- /dev/null	1 Jan 1970 00:00:00 -0000
> > +++ ./runtime/mercury_atomic.h	4 Sep 2007 07:30:35 -0000
> > @@ -0,0 +1,80 @@
...
> > +
> > +    /* gcc 4.1 and above have builtin atomic operations. */
> > +
> > +    MR_EXTERN_INLINE MR_bool
> > +    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
> > +            MR_Integer new_val)
> > +    {
> > +        return __sync_bool_compare_and_swap(addr, old, new_val);
> > +    }
> 
>  Is there a reason to prefer the inline assembler version on x86(_64)s
>  if gcc 4.1 is available?
> 

Probably not, but I haven't tested with gcc 4.1 much.

> > +/*
> > +** If we don't have definitions available for this compiler or 
> > architecture
> > +** then we will get a link error in low-level .par grades.
> 
>  So this kills of the lowlevel .par grades on powerpc for now?

Unless you use gcc 4.1+.  There's a ppc definition of CAS in libatomic_ops.

>  That looks okay otherwise, but please bootcheck in few grades in order
>  to make sure that you've got all the #ifdefs in the right spots
>  (hlc.par.gc is one such grade.)

Thanks for the review.  I've committed it, finally.

Peter

--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
Previous message: [m-rev.] for review: improve parallel execution mechanism
Next message: [m-rev.] for review: improve parallel execution mechanism
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list