[m-rev.] for review: parallel execution mechanism (1/2)

Julien Fischer juliensf at csse.unimelb.edu.au
Wed Sep 20 15:06:03 AEST 2006


On Tue, 12 Sep 2006, Peter Wang wrote:

> Estimated hours taken: 100
> Branches: main
>
> This patch changes the parallel execution mechanism in the low level backend.
> The main idea is that, even in programs with only moderate parallelism, we
> won't have enough processors to exploit it all.  We should try to reduce the
> cost in the common case, i.e. when a parallel conjunction gets executed
> sequentially.  This patch does two things along those lines:
>
> (1) Instead of unconditionally executing all parallel conjuncts (but the last)
> in separate Mercury contexts, we allow a context to continue execution of the
> next conjunct of a parallel conjunction if it has just finished executing the
> previous conjunct.  This saves on allocating unnecessary contexts, which can
> be a big reduction in memory usage.
>
> Notice we also try to execute conjuncts left-to-right so as to minimise the
> need to suspend contexts when there are dependencies between conjuncts.

Delete "Notice".

> (2) Conjuncts that *are* executed in parallel still need separate contexts.
> We used to pass variable bindings to those conjuncts by flushing input
> variable values to stack slots and copying the procedure's stack frame to the
> new context.  When the conjunct finished, we would copy new variable bindings
> back to stack slots in the original context.
>
> What happens now is that we don't do any copying back and forth.  We introduce
> a new abstract machine register `parent_sp' which points to the location of
> the stack pointer at the time that a parallel conjunction began.  In parallel
> conjuncts we refer to all stack slots via the `parent_sp' pointer, since we
> could be running on a different context altogether and `sp' would be pointing
> into a new detstack.  Since parallel conjuncts now share the procedure's stack
> frame, we have to allocate stack slots such that all parallel conjuncts in a
> procedure that could be executing simultaneously have distinct sets of stack
> slots.  We currently use the simplest possible strategy, i.e. don't allow
> variables in parallel conjuncts to reuse stack slots.
>
> Note: in effect parent_sp is a frame pointer which is only set for and used by
> the code of parallel conjuncts.  We don't call it a frame pointer as it can be
> confused with "frame variables" have to do with the nondet stack.

That last sentence doesn't parse.

> compiler/code_info.m:
> 	Add functionality to keep track of how deep inside of nested parallel
> 	conjunctions the code generator is.
>
> 	Add functionality to acquire and release "persistent" temporary stack
> 	slots.  Unlike normal temporary stack slots, these don't get implicitly
> 	released when the code generator's location-dependent state is reset.
>
> 	Conform to additions of `parent_sp' and parent stack variables.
>
> compiler/exprn_aux.m:
> 	Generalise the `substitute_lval_in_*' predicates by
> 	`transform_lval_in_*' predicates.  Instead of performing a fixed
> 	substitution, these take a higher order predicate which performs some
> 	operation on each lval.  Redefine the substitution predicates in terms
> 	of the transformation predicates.
>
> 	Conform to changes in `fork', `join_and_terminate' and
> 	`join_and_continue' instructions.
>
> 	Conform to additions of `parent_sp' and parent stack variables.
>
> 	Remove `substitute_rval_in_args' and `substitute_rval_in_arg' which
> 	were unused.
>
> compiler/live_vars.m:
> 	Introduce a new type `parallel_stackvars' which is threaded through
> 	`build_live_sets_in_goal'.  We accumulate the sets of variables which
> 	are assigned stack slots in each parallel conjunct.  At the end of
> 	processing a parallel conjunction, use this information to force
> 	variables which are assigned stack slots to have distinct slots.
>
> compiler/llds.m:
> 	Change the semantics of the `fork' instruction.  It now takes a single
> 	argument: the label of the next conjunct after the current one.  The
> 	instruction now "sparks" the next conjunct to be run, either in a
> 	different context (possibly in parallel, on another Mercury engine) or
> 	is queued to be executed in the current context after the current
> 	conjunct is finished.
>
> 	Change the semantics of the `join_and_continue' instruction.  This
> 	instruction now serves to end all parallel conjuncts, not just the
> 	last one in a parallel conjunction.
>
> 	Remove the `join_and_terminate' instruction (no longer used).
>
> 	Add the new abstract machine register `parent_sp'.
>
> 	Introduce "parent stack slots", which are the same as normal stack
> 	slots but relative to the `parent_sp' register.

s/the same as/similar to/

> compiler/par_conj_gen.m:
> 	Change the code generated for parallel conjunctions.  That is:
>
> 	- use the new `fork' instruction at the beginning of a parallel
> 	  conjunct;
>
> 	- use the `join_and_continue' instruction at the end of all parallel
> 	  conjuncts;
>
> 	- keep track of how deep the code generator currently is in parallel
> 	  conjunctions;
>
> 	- set and restore the `parent_sp' register when entering a non-nested
> 	  parallel conjunction;
>
> 	- after generating the code of a parallel conjunct, replace all
> 	  references to stack slots by parent stack slots;
>
> 	- remove code to copy back output variables when a parallel conjunct
> 	  finishes.
>
> 	Update some comments.
>
> runtime/mercury_context.c:
> runtime/mercury_context.h:
> 	Add the type `MR_Spark'.  Sparks are allocated on the heap and contain
> 	enough information to begin execution of a single parallel conjunct.
>
> 	Add globals `MR_spark_queue_head' and `MR_spark_queue_tail'.  These
> 	are pointers to the start and end of a global queue of sparks.  Idle
> 	engines can pick up work from this queue in the same way that they can
> 	pick up work from the global context queue (the "run queue").
>
> 	Add new fields to the MR_Context structure.  `MR_ctxt_parent_sp' is a
> 	saved copy of the `parent_sp' register for when the context is
> 	suspended.  `MR_ctxt_spark_stack' is a stack of sparks that we decided
> 	not to put on the global spark queue.
>
> 	Update `MR_load_context' and `MR_save_context' to save and restore
> 	`MR_ctxt_parent_sp'.
>
> 	Add the counters `MR_num_idle_engines' and
> 	`MR_num_outstanding_contexts_and_sparks'.  These are used to decide,
> 	when a `fork' instruction is reached, whether a spark should be put on
> 	the global spark queue (with potential for parallelism but also more
> 	overhead) or on the calling context's spark stack (no parallelism and
> 	less overhead).
>
> 	Rename `MR_init_context' to `MR_init_context_maybe_generator'.  When
> 	initialising contexts, don't reset redzones of already allocated
> 	stacks.  It seems to be unnecessary (and the reset implementation is
> 	buggy anyway, though it's fine on Linux).
>
> 	Rename `MR_schedule' to `MR_schedule_context'.  Add new functions
> 	`MR_schedule_spark_globally' and `MR_schedule_spark_locally'.
>
> 	In `MR_do_runnext', add code for idle engines to get work from the
> 	global spark queue.  Resuming contexts are prioritised over sparks.
>
> 	Rename `MR_fork_new_context' to `MR_fork_new_child'.  Change the
> 	definitions of `MR_fork_new_child' and `MR_join_and_continue' as per
> 	the new behaviour of the `fork' and `join_and_continue' instructions.
> 	Delete `MR_join_and_terminate'.
>
> 	Add a new field `MR_st_orig_context' to the MR_SyncTerm structure to
> 	record which context originated the parallel conjunction instance
> 	represented by a MR_SyncTerm instance, and update `MR_init_sync_term'.
> 	This is needed by the new behaviour of `MR_join_and_continue'.
>
> 	Update some comments.
>
> runtime/mercury_engine.h:
> runtime/mercury_regs.c:
> runtime/mercury_regs.h:
> runtime/mercury_stacks.h:
> 	Add the abstract machine register `parent_sp' and code to copy it to
> 	and from the fake_reg array.
>
> 	Add a macro `MR_parent_sv' to access stack slots via `parent_sp'.
>
> 	Add `MR_eng_parent_sp' to the MercuryEngine structure.
>
> runtime/mercury_wrapper.c:
> runtime/mercury_wrapper.h:
> 	Add Mercury runtime option `--max-contexts-per-thread' which is saved
> 	in the global variable `MR_max_contexts_per_thread'.  The number
> 	`MR_max_outstanding_contexts' is derived from this.  It sets a soft
> 	limit on the number of sparks we put in the global spark queue,
> 	relative to the number of threads we are running.  We don't want to
> 	put too many sparks on the global queue if there are plenty of ready
> 	contexts or sparks already on the global queues, as they are likely to
> 	result in new contexts being allocated.
>
> 	When initially creating worker engines, wait until all the worker
> 	engines have acknowledged that they are idle before continuing.  This
> 	is mainly so programs (especially benchmarks and test cases) with only
> 	a few fork instructions near the beginning of the program don't
> 	execute the forks before any worker engines are ready, resulting in no
> 	parallelism.
>
> runtime/mercury_engine.c:
> runtime/mercury_thread.c:
> 	Don't allocate a context at the time a Mercury engine is created.  An
> 	engine only needs a new context when it is about to pick up a spark.
>
> configure.in:
> compiler/options.m:
> scripts/Mercury.config.in:
> 	Update to reflect the extra field in MR_SyncTerm.
>
> 	Add the option `--sync-term-size' and actually make use the result of
> 	the sync term size calculated during configuration.
>
> compiler/code_util.m:
> compiler/continuation_info.m:
> compiler/dupelim.m:
> compiler/dupproc.m:
> compiler/global_data.m:
> compiler/hlds_llds.m:
> compiler/jumpopt.m:
> compiler/livemap.m:
> compiler/llds_out.m:
> compiler/middle_rec.m:
> compiler/opt_debug.m:
> compiler/opt_util.m:
> compiler/reassign.m:
> compiler/stack_layout.m:
> compiler/use_local_vars.m:
> compiler/var_locn.m:
> 	Conform to changes in `fork', `join_and_terminate' and
> 	`join_and_continue' instructions.
>
> 	Conform to additions of `parent_sp' and parent stack variables.
>
> 	XXX not sure about the changes in stack_layout.m
>
> library/par_builtin.m:
> 	Conform to changes in the runtime system.

...

> Index: compiler/code_info.m
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/compiler/code_info.m,v
> retrieving revision 1.330
> diff -u -r1.330 code_info.m
> --- compiler/code_info.m	6 Sep 2006 04:02:55 -0000	1.330
> +++ compiler/code_info.m	11 Sep 2006 05:04:47 -0000
> @@ -167,6 +167,14 @@
>     %
> :- pred set_instmap(instmap::in, code_info::in, code_info::out) is det.
>
> +    % The depth of nested parallel conjunctions.
> +    %
> +:- pred get_par_conj_depth(code_info::in, int::out) is det.

That should say something like: return current nesting depth for
parallel conjunctions.

> @@ -252,6 +260,11 @@
> :- pred set_temp_content_map(map(lval, slot_contents)::in,
>     code_info::in, code_info::out) is det.
>
> +:- pred get_persistent_temps(code_info::in, set(lval)::out) is det.
> +
> +:- pred set_persistent_temps(set(lval)::in,
> +    code_info::in, code_info::out) is det.
> +
> :- pred set_closure_layouts(list(layout_data)::in,
>     code_info::in, code_info::out) is det.
>
> @@ -363,8 +376,14 @@
>                                     % fields below. Any keys in that map which
>                                     % are not in this set are free for reuse.
>
> -                fail_info           :: fail_info
> +                fail_info           :: fail_info,
>                                     % Information about how to manage failures.
> +
> +                par_conj_depth      :: int
> +                                    % How deep in a nested parallel conjunction
> +                                    % we are. This is zero at the beginning of
> +                                    % a procedure and increments as we enter
> +                                    % parallel conjunctions.

s/increments/incremented/

>             ).
>
> :- type code_info_persistent
> @@ -399,6 +418,11 @@
>                                     % slot contains after the end of the
>                                     % branched control structure.
>
> +                persistent_temps    :: set(lval),
> +                                    % Stack slot locations that should not be
> +                                    % released event when the code generator

s/event/even/

> +                                    % resets its location-dependent state.
> +
>                 closure_layout_seq :: counter,
>
>                 closure_layouts     :: list(layout_data),

...

> @@ -1034,9 +1065,14 @@
> remember_position(CI, position_info(CI)).
>
> reset_to_position(position_info(PosCI), CurCI, NextCI) :-
> -    PosCI  = code_info(_, LocDep, _),
> -    CurCI  = code_info(Static, _, Persistent),
> -    NextCI = code_info(Static, LocDep, Persistent).
> +    PosCI   = code_info(_, LocDep, _),
> +    CurCI   = code_info(Static, _, Persistent),
> +    NextCI0 = code_info(Static, LocDep, Persistent),
> +
> +    get_persistent_temps(NextCI0, PersistentTemps),
> +    get_temps_in_use(NextCI0, TempsInUse0),
> +    set.union(PersistentTemps, TempsInUse0, TempsInUse),
> +    set_temps_in_use(TempsInUse, NextCI0, NextCI).
>
> reset_resume_known(BranchStart, !CI) :-
>     BranchStart = position_info(BranchStartCI),
> @@ -3352,6 +3388,8 @@
>     ;
>         AbsLocn = abs_stackvar(_)
>     ;
> +        AbsLocn = abs_parent_stackvar(_)
> +    ;
>         AbsLocn = abs_framevar(_)
>     ).
>

...

> Index: compiler/llds.m
> ===================================================================
> RCS file: /home/mercury/mercury1/repository/mercury/compiler/llds.m,v
> retrieving revision 1.339
> diff -u -r1.339 llds.m
> --- compiler/llds.m	4 Sep 2006 01:47:31 -0000	1.339
> +++ compiler/llds.m	11 Sep 2006 05:04:48 -0000
> @@ -481,24 +481,17 @@
>             % documentation in par_conj_gen.m and runtime/mercury_context.{c,h}
>             % for further information about synchronisation terms.)
>
> -    ;       fork(label, label, int)
> -            % Create a new context. fork(Child, Parent, NumSlots) creates
> -            % a new thread which will start executing at Child. After spawning
> -            % execution in the child, control branches to Parent. NumSlots is
> -            % the number of stack slots that need to be copied to the child's
> -            % stack (see comments in runtime/mercury_context.{h,c}).
> -
> -    ;       join_and_terminate(lval)
> -            % Signal that this thread of execution has finished in the current
> -            % parallel conjunction, then terminate it. The synchronisation term
> -            % is specified by the given lval. (See the documentation in
> -            % par_conj_gen.m and runtime/mercury_context.{c,h} for further
> -            % information about synchronisation terms.)
> +    ;       fork(label)
> +            % Create a new spark. fork(Child) creates spark, to begin execution
> +            % at Child. Control continues at the next instruction.
>
>     ;       join_and_continue(lval, label).
>             % Signal that this thread of execution has finished in the current
> -            % parallel conjunction, then branch to the given label. The
> -            % synchronisation term is specified by the given lval.
> +            % parallel conjunct.  How to proceed at the end of a parallel
> +            % conjunct is quite involved; see runtime/mercury_context.{c,h}.

I suggest:

 	For details of how we proceed at the end of a parallel conjunct
 	see runtime/mercury_context.{c,h}.

> +            % The synchronisation term is specified by the given lval.
> +            % The label gives the address of the code following the parallel
> +            % conjunction.
>
> :- type nondet_frame_info
>     --->    temp_frame(
> @@ -777,7 +770,15 @@
>             % Virtual machine register holding the heap pointer.
>
>     ;       sp
> -            % Virtual machine register point to the top of det stack.
> +            % Virtual machine register pointing to the top of det stack.
> +
> +    ;       parent_sp
> +            % Virtual machine register pointing to the top of the det stack.
> +            % This is only set at the beginning of a parallel conjunction (and
> +            % restored afterwards). Parallel conjuncts which refer to stack
> +            % slots use this register instead of sp, as they could be running
> +            % in a different context, where sp would be pointing into a
> +            % different det stack.
>
>     ;       temp(reg_type, int)
>             % A local temporary register. These temporary registers are
> @@ -795,6 +796,11 @@
>             % current value of `sp'. These are used in both det and semidet
>             % code. Stackvar slot numbers start at 1.
>
> +    ;       parent_stackvar(int)
> +            % A det stack slot. The number is the offset relative to the
> +            % value of `parent_sp'. These are used only in the code
> +            % of parallel conjuncts. Stackvar slot numbers start at 1.
> +
>     ;       framevar(int)
>             % A nondet stack slot. The reference is relative to the current
>             % value of `curfr'. These are used in nondet code. Framevar slot


To be continued ...

Julien.
--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list