[m-rev.] for review: improve parallel execution mechanism

Julien Fischer juliensf at csse.unimelb.edu.au
Tue Sep 4 19:15:37 AEST 2007


On Tue, 4 Sep 2007, Peter Wang wrote:

> Branches: main
>
> Make the parallel conjunction execution mechanism more efficient.
>
> 1. Don't allocate sync terms on the heap.  Sync terms are now allocated in
> the stack frame of the procedure call which originates a parallel
> conjunction.
>
> 2. Don't allocate individual sparks on the heap.  Sparks are now stored in
> preallocated, growing arrays using an algorithm that doesn't use locks.

preallocated, growing arrays?

> 3. Don't have one mutex per sync term.  Just use one mutex to protect
> concurrent accesses to all sync terms (it's is rarely needed anyway).  This
> makes sync terms smaller and saves initialising a mutex for each parallel
> conjunction encountered.
>
> 4. We don't bother to acquire the global sync term lock if we know a parallel
> conjunction couldn't be executing in parallel.  In a highly parallel program,
> the majority of parallel conjunctions will be executed sequentially so
> protecting the sync terms from concurrent accesses is unnecessary.
>
>
> par_fib(39) is ~8.4 times faster (user time) on my laptop (Linux 2.6, x86_64),

Did you try any other benchmarks?


> which is ~3.5 as slow as sequential execution.

How much of that remaining overhead is down to GC?

> configure.in:
> 	Update the configuration for a changed MR_SyncTerm structure.
>
> compiler/llds.m:
> 	Make the fork instruction take a second argument, which is the base
> 	stack slot of the sync term.
>
> 	Rename it to fork_new_child to match the macro name in the runtime.
>
> compiler/par_conj_gen.m:
> 	Change the generated code for parallel conjunctions to allocate sync
> 	terms on the stack and to pass the sync term to fork_new_child.
>
> compiler/dupelim.m:
> compiler/dupproc.m:
> compiler/exprn_aux.m:
> compiler/global_data.m:
> compiler/jumpopt.m:
> compiler/livemap.m:
> compiler/llds_out.m:
> compiler/llds_to_x86_64.m:
> compiler/middle_rec.m:
> compiler/opt_debug.m:
> compiler/opt_util.m:
> compiler/reassign.m:
> compiler/use_local_vars.m:
> 	Conform to the change in the fork instruction.
>
> compiler/liveness.m:
> compiler/proc_gen.m:
> 	Disable use of the parallel conjunction operator in the compiler as
> 	older versions of the compiler will generate code incompatible with
> 	the new runtime.

I take it you plan to add them back after this change has bootchecked?

> runtime/mercury_context.c:
> runtime/mercury_context.h:
> 	Remove the next pointer field from MR_Spark as it's no longer needed.
>
> 	Remove the mutex from MR_SyncTerm.  Add a field to record if a spark
> 	belonging to the sync term was scheduled globally, i.e. if the
> 	parallel conjunction might be executed in parallel.
>
> 	Define MR_SparkDeque and MR_SparkArray.
>
> 	Use MR_SparkDeques to hold per-context sparks and global sparks.
>
> 	Change the abstract machine instructions MR_init_sync_term,
> 	MR_fork_new_child, MR_join_and_continue as per the main change log.
>
> 	Use a preprocessor macro MR_LL_PARALLEL_CONJ as a shorthand for
> 	!MR_HIGHLEVEL_CODE && MR_THREAD_SAFE.
>
> 	Take the opportunity to clean things up a bit.
>
> runtime/mercury_wsdeque.c:
> runtime/mercury_wsdeque.h:
> 	New files containing an implementation of work-stealing deques.  We
> 	don't do work stealing yet but we use the underlying data structure.
>
> runtime/mercury_atomic.h:
> 	New file to contain atomic operations.  Currently it just contains
> 	compare-and-swap for gcc/x86_64, gcc/x86 and gcc-4.1.

I think mercury_atomic_ops.h would be a better name.

...

> Index: ./compiler/par_conj_gen.m
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/compiler/par_conj_gen.m,v
> retrieving revision 1.35
> diff -u -r1.35 par_conj_gen.m
> --- ./compiler/par_conj_gen.m	7 Aug 2007 07:10:01 -0000	1.35
> +++ ./compiler/par_conj_gen.m	4 Sep 2007 07:30:35 -0000
> @@ -74,11 +74,10 @@
> % finished, causes the code following the parallel conjunction to execute in
> % the context that originated the parallel conjunction.  If the originating
> % context can't execute the next conjunct and the parallel conjunction isn't
> -% finished, it must suspend and store its address in the sync term.  When a
> -% non-originating context later finds that the parallel conjunction _is_
> -% finished, it will then cause the originating context to resume execution
> -% at the join point.  Please see the implementation of MR_join_and_continue()
> -% for the details.
> +% finished, it must suspend.  When a non-originating context later finds that
> +% the parallel conjunction _is_ finished, it will then cause the originating
> +% context to resume execution at the join point.  Please see the
> +% implementation of MR_join_and_continue() for the details.
> %
> % The runtime support for parallel conjunction is documented in the runtime
> % directory in mercury_context.{c,h}.

...

> --- ./compiler/liveness.m	7 Aug 2007 07:09:57 -0000	1.156
> +++ ./compiler/liveness.m	4 Sep 2007 07:30:35 -0000
> @@ -241,7 +241,7 @@
>         list.split_list(1000, PredIds, HeadPredIds, TailPredIds)
>     then
>         ( detect_liveness_preds_parallel_3(HeadPredIds, HLDS0, !HLDS)
> -        & detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
> +        , detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
>         )
>     else
>         detect_liveness_preds_parallel_3(PredIds, HLDS0, !HLDS)
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/compiler/proc_gen.m,v
> retrieving revision 1.22
> diff -u -r1.22 proc_gen.m
> --- ./compiler/proc_gen.m	13 Aug 2007 03:01:43 -0000	1.22
> +++ ./compiler/proc_gen.m	4 Sep 2007 07:30:35 -0000
> @@ -171,7 +171,7 @@
>         list.map_foldl(generate_pred_code_par(ModuleInfo0),
>             PredIdsA, PredProceduresA, GlobalData0, GlobalDataA),
>         list.condense(PredProceduresA, ProceduresA)
> -    &
> +    ,
>         list.condense(ListsOfPredIdsB, PredIdsB),
>         GlobalData1 = bump_type_num_counter(GlobalData0, type_num_skip),
>         list.map_foldl(generate_pred_code_par(ModuleInfo0),

Add an XXX comment mentioning that this should be a parallel
conjunction.  (And in the module above.)



> Index: ./compiler/llds.m
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/compiler/llds.m,v
> retrieving revision 1.352
> diff -u -r1.352 llds.m
> --- ./compiler/llds.m	20 Aug 2007 03:35:57 -0000	1.352
> +++ ./compiler/llds.m	4 Sep 2007 07:30:35 -0000
> @@ -546,16 +546,19 @@
>             % duplicated by jump optimization.
>
>     ;       init_sync_term(lval, int)
> -            % Initialize a synchronization term. The first argument contains
> -            % the lvalue into which we will store the synchronization term,
> -            % and the second argument indicates how many branches we expect
> -            % to join at the end of the parallel conjunction. (See the
> -            % documentation in par_conj_gen.m and runtime/mercury_context.{c,h}
> -            % for further information about synchronisation terms.)
> -
> -    ;       fork(label)
> -            % Create a new spark. fork(Child) creates spark, to begin execution
> -            % at Child. Control continues at the next instruction.
> +            % Initialize a synchronization term, which is a continuous number
> +            % of slots on the detstack.  The first argument contains the base
> +            % address of the synchronization term.  The second argument
> +            % indicates how many branches we expect to join at the end of the
> +            % parallel conjunction. (See the documentation in par_conj_gen.m
> +            % and runtime/mercury_context.{c,h} for further information about
> +            % synchronisation terms.)
> +
> +    ;       fork_new_child(lval, label)
> +            % Create a new spark. fork(SyncTerm, Child) creates spark, to begin
> +            % execution at Child, where SyncTerm contains the base address of
> +            % the synchronisation term. Control continues at the next
> +            % instruction.
>
>     ;       join_and_continue(lval, label).
>             % Signal that this thread of execution has finished in the current



> Index: ./runtime/mercury_context.h
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_context.h,v
> retrieving revision 1.43
> diff -u -r1.43 mercury_context.h
> --- ./runtime/mercury_context.h	10 May 2007 05:24:16 -0000	1.43
> +++ ./runtime/mercury_context.h	4 Sep 2007 07:30:35 -0000



> @@ -59,6 +59,14 @@
> #include "mercury_goto.h"       /* for MR_GOTO() */
> #include "mercury_conf.h"       /* for MR_CONSERVATIVE_GC */
>
> +#if !defined(MR_HIGHLEVEL_CODE) && defined(MR_THREAD_SAFE)
> +  /*
> +  ** Whether we are in a grade which supports the low-level parallel
> +  ** conjunction execution mechanism.
> +  */
> +  #define MR_LL_PARALLEL_CONJ 1
> +#endif

mercury_conf_param.h would be a more appropriate place to define that
IMO.

...

> Index: ./runtime/mercury_context.c
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_context.c,v
> retrieving revision 1.56
> diff -u -r1.56 mercury_context.c
> --- ./runtime/mercury_context.c	1 May 2007 01:11:42 -0000	1.56
> +++ ./runtime/mercury_context.c	4 Sep 2007 07:30:35 -0000
> @@ -40,17 +40,24 @@
> /*
> ** The run queue and spark queue are protected and signalled with the
> ** same lock and condition variable.
> +**
> +** The single sync term lock is used to prevent races in MR_join_and_continue.
> +** The holder of the sync term lock may acquire the runqueue lock but not vice
> +** versa.  (We could also have one sync term lock per context, and make
> +** MR_join_and_continue acquire the sync term lock of the context that
> +** originated the parallel conjunction, but contention of the single lock

contention for the single lock

...

> Index: ./runtime/mercury_wsdeque.h
> ===================================================================
> RCS file: ./runtime/mercury_wsdeque.h
> diff -N ./runtime/mercury_wsdeque.h
> --- /dev/null	1 Jan 1970 00:00:00 -0000
> +++ ./runtime/mercury_wsdeque.h	4 Sep 2007 07:30:35 -0000
> @@ -0,0 +1,158 @@
> +/*
> +** vim: ts=4 sw=4 expandtab
> +*/
> +/*
> +** Copyright (C) 2007 The University of Melbourne.
> +** This file may only be copied under the terms of the GNU Library General
> +** Public License - see the file COPYING.LIB in the Mercury distribution.
> +*/
> +
> +#ifndef MERCURY_WSDEQUE_H
> +#define MERCURY_WSDEQUE_H
> +
> +#ifdef MR_LL_PARALLEL_CONJ
> +
> +#include "mercury_atomic.h"
> +
> +/* XXX should experiment with these */
> +#define MR_INITIAL_GLOBAL_SPARK_QUEUE_SIZE  4
> +#define MR_INITIAL_LOCAL_SPARK_DEQUE_SIZE   8
> +
> +/*---------------------------------------------------------------------------*/
> +
> +/* See mercury_context.h for the definition of MR_SparkDeque. */
> +
> +struct MR_SparkArray_Struct {
> +    MR_Integer          MR_sa_max;          /* power of two - 1 */
> +    volatile MR_Spark   MR_sa_segment[1];   /* really MR_sa_max + 1 */
> +};
> +
> +/*
> +** MR_sa_element(Array, Pos)
> +** Index into Array modulo its size, i.e. treating it as a circular array.
> +**
> +** MR_sa_max is a power of two - 1 so that we can use a bitwise AND operation
> +** operation instead of modulo when indexing into the array, which makes a
> +** significant difference.
> +*/
> +#define MR_sa_element(arr, pos)     (arr->MR_sa_segment[pos & arr->MR_sa_max])

Place parentheses around `pos' and `arr' in the body of that macro.

...

> Index: ./runtime/mercury_atomic.h
> ===================================================================
> RCS file: ./runtime/mercury_atomic.h
> diff -N ./runtime/mercury_atomic.h
> --- /dev/null	1 Jan 1970 00:00:00 -0000
> +++ ./runtime/mercury_atomic.h	4 Sep 2007 07:30:35 -0000
> @@ -0,0 +1,80 @@
> +/*
> +** vim:ts=4 sw=4 expandtab
> +*/
> +/*
> +** Copyright (C) 2007 The University of Melbourne.
> +** This file may only be copied under the terms of the GNU Library General
> +** Public License - see the file COPYING.LIB in the Mercury distribution.
> +*/
> +
> +/*
> +** mercury_atomic.h - defines atomic operations.
> +*/
> +
> +#ifndef MERCURY_ATOMIC_H
> +#define MERCURY_ATOMIC_H
> +
> +/*
> +** If the value at addr is equal to old, assign new to addr and return true.
> +** Otherwise return false.
> +*/
> +MR_EXTERN_INLINE MR_bool
> +MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
> +        MR_Integer new_val) ;
> +
> +/*---------------------------------------------------------------------------*/
> +
> +#if defined(__GNUC__) && defined(__x86_64__)
> +
> +    MR_EXTERN_INLINE MR_bool
> +    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
> +            MR_Integer new_val)
> +    {
> +        char result;
> +
> +        __asm__ __volatile__(
> +            "lock; cmpxchgq %3, %0; setz %1"
> +            : "=m"(*addr), "=q"(result)
> +            : "m"(*addr), "r" (new_val), "a"(old)
> +            : "memory"
> +        );
> +        return (int) result;
> +    }
> +
> +#elif defined(__GNUC__) && defined(__i386__)
> +
> +    /* Really 486 or better. */
> +    MR_EXTERN_INLINE MR_bool
> +    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
> +            MR_Integer new_val)
> +    {
> +        char result;
> +
> +        __asm__ __volatile__(
> +            "lock; cmpxchgl %3, %0; setz %1"
> +            : "=m"(*addr), "=q"(result)
> +            : "m"(*addr), "r" (new_val), "a"(old)
> +            : "memory");
> +        return (int) result;
> +    }
> +
> +#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
> +
> +    /* gcc 4.1 and above have builtin atomic operations. */
> +
> +    MR_EXTERN_INLINE MR_bool
> +    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
> +            MR_Integer new_val)
> +    {
> +        return __sync_bool_compare_and_swap(addr, old, new_val);
> +    }

Is there a reason to prefer the inline assembler version on x86(_64)s
if gcc 4.1 is available?

...

> +/*
> +** If we don't have definitions available for this compiler or architecture
> +** then we will get a link error in low-level .par grades.

So this kills of the lowlevel .par grades on powerpc for now?

> No other grades
> +** currently require on atomic ops.

Delete "on".

That looks okay otherwise, but please bootcheck in few grades in order
to make sure that you've got all the #ifdefs in the right spots
(hlc.par.gc is one such grade.)

Julien.
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list