[m-rev.] for review: improve parallel execution mechanism

Peter Wang novalazy at gmail.com
Tue Sep 4 17:39:08 AEST 2007


Branches: main

Make the parallel conjunction execution mechanism more efficient.

1. Don't allocate sync terms on the heap.  Sync terms are now allocated in
the stack frame of the procedure call which originates a parallel
conjunction.

2. Don't allocate individual sparks on the heap.  Sparks are now stored in
preallocated, growing arrays using an algorithm that doesn't use locks.

3. Don't have one mutex per sync term.  Just use one mutex to protect
concurrent accesses to all sync terms (it's is rarely needed anyway).  This
makes sync terms smaller and saves initialising a mutex for each parallel
conjunction encountered.

4. We don't bother to acquire the global sync term lock if we know a parallel
conjunction couldn't be executing in parallel.  In a highly parallel program,
the majority of parallel conjunctions will be executed sequentially so
protecting the sync terms from concurrent accesses is unnecessary.


par_fib(39) is ~8.4 times faster (user time) on my laptop (Linux 2.6, x86_64),
which is ~3.5 as slow as sequential execution.


configure.in:
	Update the configuration for a changed MR_SyncTerm structure.

compiler/llds.m:
	Make the fork instruction take a second argument, which is the base
	stack slot of the sync term.

	Rename it to fork_new_child to match the macro name in the runtime.

compiler/par_conj_gen.m:
	Change the generated code for parallel conjunctions to allocate sync
	terms on the stack and to pass the sync term to fork_new_child.

compiler/dupelim.m:
compiler/dupproc.m:
compiler/exprn_aux.m:
compiler/global_data.m:
compiler/jumpopt.m:
compiler/livemap.m:
compiler/llds_out.m:
compiler/llds_to_x86_64.m:
compiler/middle_rec.m:
compiler/opt_debug.m:
compiler/opt_util.m:
compiler/reassign.m:
compiler/use_local_vars.m:
	Conform to the change in the fork instruction.

compiler/liveness.m:
compiler/proc_gen.m:
	Disable use of the parallel conjunction operator in the compiler as
	older versions of the compiler will generate code incompatible with
	the new runtime.

runtime/mercury_context.c:
runtime/mercury_context.h:
	Remove the next pointer field from MR_Spark as it's no longer needed.

	Remove the mutex from MR_SyncTerm.  Add a field to record if a spark
	belonging to the sync term was scheduled globally, i.e. if the
	parallel conjunction might be executed in parallel.

	Define MR_SparkDeque and MR_SparkArray.

	Use MR_SparkDeques to hold per-context sparks and global sparks.

	Change the abstract machine instructions MR_init_sync_term,
	MR_fork_new_child, MR_join_and_continue as per the main change log.

	Use a preprocessor macro MR_LL_PARALLEL_CONJ as a shorthand for
	!MR_HIGHLEVEL_CODE && MR_THREAD_SAFE.

	Take the opportunity to clean things up a bit.

runtime/mercury_wsdeque.c:
runtime/mercury_wsdeque.h:
	New files containing an implementation of work-stealing deques.  We
	don't do work stealing yet but we use the underlying data structure.

runtime/mercury_atomic.h:
	New file to contain atomic operations.  Currently it just contains
	compare-and-swap for gcc/x86_64, gcc/x86 and gcc-4.1.

runtime/Mmakefile:
	Add the new files.

runtime/mercury_engine.h:
runtime/mercury_mm_own_stacks.c:
runtime/mercury_wrapper.c:
	Conform to runtime changes.

runtime/mercury_conf_param.h:
	Update an outdated comment.


Index: configure.in
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/configure.in,v
retrieving revision 1.499
diff -u -r1.499 configure.in
--- configure.in	13 Aug 2007 01:10:39 -0000	1.499
+++ configure.in	4 Sep 2007 07:30:35 -0000
@@ -1719,10 +1719,9 @@
 	#include <pthread.h>
 	int main() {
 		struct {
-			pthread_mutex_t lock;
 			void		*orig_context;
 			int		count;
-			void		*parent;
+			int 		is_shared;
 		} x;
 		FILE *fp;
 
Index: ./compiler/llds_to_x86_64.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/llds_to_x86_64.m,v
retrieving revision 1.8
diff -u -r1.8 llds_to_x86_64.m
--- ./compiler/llds_to_x86_64.m	20 Aug 2007 03:35:57 -0000	1.8
+++ ./compiler/llds_to_x86_64.m	4 Sep 2007 07:30:35 -0000
@@ -428,7 +428,8 @@
     Instr = [x86_64_comment("<<foreign_proc_code>>")].
 instr_to_x86_64(!RegMap, init_sync_term(_, _), Instr) :-
     Instr = [x86_64_comment("<<init_sync_term>>")].
-instr_to_x86_64(!RegMap, fork(_), [x86_64_comment("<<fork>>")]).
+instr_to_x86_64(!RegMap, fork_new_child(_, _), Instr) :-
+    Instr = [x86_64_comment("<<fork_new_child>>")].
 instr_to_x86_64(!RegMap, join_and_continue(_, _), Instr) :-
     Instr = [x86_64_comment("<<join_and_continue>>")].
 
Index: ./compiler/par_conj_gen.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/par_conj_gen.m,v
retrieving revision 1.35
diff -u -r1.35 par_conj_gen.m
--- ./compiler/par_conj_gen.m	7 Aug 2007 07:10:01 -0000	1.35
+++ ./compiler/par_conj_gen.m	4 Sep 2007 07:30:35 -0000
@@ -74,11 +74,10 @@
 % finished, causes the code following the parallel conjunction to execute in
 % the context that originated the parallel conjunction.  If the originating
 % context can't execute the next conjunct and the parallel conjunction isn't
-% finished, it must suspend and store its address in the sync term.  When a
-% non-originating context later finds that the parallel conjunction _is_
-% finished, it will then cause the originating context to resume execution
-% at the join point.  Please see the implementation of MR_join_and_continue()
-% for the details.
+% finished, it must suspend.  When a non-originating context later finds that
+% the parallel conjunction _is_ finished, it will then cause the originating
+% context to resume execution at the join point.  Please see the
+% implementation of MR_join_and_continue() for the details.
 %
 % The runtime support for parallel conjunction is documented in the runtime
 % directory in mercury_context.{c,h}.
@@ -180,33 +179,33 @@
     code_info.get_module_info(!.CI, ModuleInfo),
     find_outputs(Variables, Initial, Final, ModuleInfo, [], Outputs),
 
-    list.length(Goals, NumGoals),
-    code_info.acquire_reg(reg_r, RegLval, !CI),
-    code_info.acquire_temp_slot(slot_sync_term, persistent_temp_slot, SyncSlot,
-        !CI),
-    ( SyncSlot = stackvar(SlotNum) ->
-        ParentSyncSlot = parent_stackvar(SlotNum)
+    % Reserve a contiguous block on the stack to hold the synchronisation term.
+    Contents = list.duplicate(STSize, slot_sync_term),
+    code_info.acquire_several_temp_slots(Contents, persistent_temp_slot,
+        SyncTermSlots, StackId, _N, _M, !CI),
+    (
+        % The highest numbered slot has the lowest address.
+        list.last(SyncTermSlots, SyncTermBaseSlotPrime),
+        SyncTermBaseSlotPrime = stackvar(SlotNum),
+        StackId = det_stack
+    ->
+        SyncTermBaseSlot = SyncTermBaseSlotPrime,
+        ParentSyncTermBaseSlot = parent_stackvar(SlotNum)
     ;
         unexpected(this_file, "generate_par_conj")
     ),
 
+    NumGoals = list.length(Goals),
     MakeSyncTermCode = node([
-        % The may_not_use_atomic here is conservative.
-        llds_instr(incr_hp(RegLval, no, no, const(llconst_int(STSize)),
-            "sync term", may_not_use_atomic_alloc, no),
-            "allocate a sync term"),
-        llds_instr(init_sync_term(RegLval, NumGoals),
-            "initialize sync term"),
-        llds_instr(assign(SyncSlot, lval(RegLval)),
-            "store the sync term on the stack")
+        llds_instr(init_sync_term(SyncTermBaseSlot, NumGoals),
+            "initialize sync term")
     ]),
-    code_info.release_reg(RegLval, !CI),
 
     code_info.set_par_conj_depth(Depth+1, !CI),
     code_info.get_next_label(EndLabel, !CI),
     code_info.clear_all_registers(no, !CI),
-    generate_det_par_conj_2(Goals, ParentSyncSlot, EndLabel, Initial, no,
-        GoalCode, !CI),
+    generate_det_par_conj_2(Goals, ParentSyncTermBaseSlot, EndLabel, Initial,
+        no, GoalCode, !CI),
     code_info.set_par_conj_depth(Depth, !CI),
 
     EndLabelCode = node([
@@ -240,7 +239,8 @@
     % XXX release sync slots of nested parallel conjunctions
     %
     ( Depth = 0 ->
-        code_info.release_temp_slot(SyncSlot, persistent_temp_slot, !CI)
+        code_info.release_several_temp_slots(SyncTermSlots,
+            persistent_temp_slot, !CI)
     ;
         true
     ),
@@ -258,9 +258,9 @@
     lval::in, label::in, instmap::in, branch_end::in, code_tree::out,
     code_info::in, code_info::out) is det.
 
-generate_det_par_conj_2([], _ParentSyncTerm, _EndLabel,
+generate_det_par_conj_2([], _ParentSyncTermBaseSlot, _EndLabel,
         _Initial, _, empty, !CI).
-generate_det_par_conj_2([Goal | Goals], ParentSyncTerm, EndLabel,
+generate_det_par_conj_2([Goal | Goals], ParentSyncTermBaseSlot, EndLabel,
         Initial, MaybeEnd0, Code, !CI) :-
     code_info.remember_position(!.CI, StartPos),
     code_gen.generate_goal(model_det, Goal, ThisGoalCode0, !CI),
@@ -280,22 +280,26 @@
         code_info.get_next_label(NextConjunct, !CI),
         code_info.reset_to_position(StartPos, !CI),
         ForkCode = node([
-            llds_instr(fork(NextConjunct), "fork off a child")
+            llds_instr(fork_new_child(ParentSyncTermBaseSlot, NextConjunct),
+                "fork off a child")
         ]),
         JoinCode = node([
-            llds_instr(join_and_continue(ParentSyncTerm, EndLabel), "finish"),
-            llds_instr(label(NextConjunct), "start of the next conjunct")
+            llds_instr(join_and_continue(ParentSyncTermBaseSlot, EndLabel),
+                "finish"),
+            llds_instr(label(NextConjunct),
+                "start of the next conjunct")
         ])
     ;
         Goals = [],
         ForkCode = empty,
         JoinCode = node([
-            llds_instr(join_and_continue(ParentSyncTerm, EndLabel), "finish")
+            llds_instr(join_and_continue(ParentSyncTermBaseSlot, EndLabel),
+                "finish")
         ])
     ),
     ThisCode = tree_list([ForkCode, ThisGoalCode, SaveCode, JoinCode]),
-    generate_det_par_conj_2(Goals, ParentSyncTerm, EndLabel, Initial, MaybeEnd,
-        RestCode, !CI),
+    generate_det_par_conj_2(Goals, ParentSyncTermBaseSlot, EndLabel, Initial,
+        MaybeEnd, RestCode, !CI),
     Code = tree(ThisCode, RestCode).
 
 %-----------------------------------------------------------------------------%
Index: ./compiler/opt_debug.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/opt_debug.m,v
retrieving revision 1.197
diff -u -r1.197 opt_debug.m
--- ./compiler/opt_debug.m	20 Aug 2007 03:36:03 -0000	1.197
+++ ./compiler/opt_debug.m	4 Sep 2007 07:30:35 -0000
@@ -893,11 +893,12 @@
         Str = "init_sync_term(" ++ dump_lval(yes(ProcLabel), Lval) ++ ", "
             ++ int_to_string(N) ++ ")"
     ;
-        Instr = fork(Child),
-        Str = "fork(" ++ dump_label(yes(ProcLabel), Child) ++ ")"
+        Instr = fork_new_child(Lval, Child),
+        Str = "fork_new_child(" ++ dump_lval(yes(ProcLabel), Lval)
+            ++ dump_label(yes(ProcLabel), Child) ++ ", " ++ ")"
     ;
         Instr = join_and_continue(Lval, Label),
-        Str = "join(" ++ dump_lval(yes(ProcLabel), Lval) ++ ", "
+        Str = "join_and_continue(" ++ dump_lval(yes(ProcLabel), Lval) ++ ", "
             ++ dump_label(yes(ProcLabel), Label) ++ ")"
     ;
         Instr = foreign_proc_code(Decls, Comps, MCM, MFNL, MFL, MFOL, MNF,
Index: ./compiler/dupproc.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/dupproc.m,v
retrieving revision 1.21
diff -u -r1.21 dupproc.m
--- ./compiler/dupproc.m	20 Aug 2007 03:35:53 -0000	1.21
+++ ./compiler/dupproc.m	4 Sep 2007 07:30:35 -0000
@@ -222,9 +222,9 @@
         Instr = incr_sp(NumSlots, _, Kind),
         StdInstr = incr_sp(NumSlots, "", Kind)
     ;
-        Instr = fork(Child),
+        Instr = fork_new_child(Lval, Child),
         standardize_label(Child, StdChild, DupProcMap),
-        StdInstr = fork(StdChild)
+        StdInstr = fork_new_child(Lval, StdChild)
     ;
         Instr = join_and_continue(Lval, Label),
         standardize_label(Label, StdLabel, DupProcMap),
Index: ./compiler/liveness.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/liveness.m,v
retrieving revision 1.156
diff -u -r1.156 liveness.m
--- ./compiler/liveness.m	7 Aug 2007 07:09:57 -0000	1.156
+++ ./compiler/liveness.m	4 Sep 2007 07:30:35 -0000
@@ -241,7 +241,7 @@
         list.split_list(1000, PredIds, HeadPredIds, TailPredIds)
     then
         ( detect_liveness_preds_parallel_3(HeadPredIds, HLDS0, !HLDS)
-        & detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
+        , detect_liveness_preds_parallel_2(TailPredIds, HLDS0, !HLDS)
         )
     else
         detect_liveness_preds_parallel_3(PredIds, HLDS0, !HLDS)
Index: ./compiler/proc_gen.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/proc_gen.m,v
retrieving revision 1.22
diff -u -r1.22 proc_gen.m
--- ./compiler/proc_gen.m	13 Aug 2007 03:01:43 -0000	1.22
+++ ./compiler/proc_gen.m	4 Sep 2007 07:30:35 -0000
@@ -171,7 +171,7 @@
         list.map_foldl(generate_pred_code_par(ModuleInfo0),
             PredIdsA, PredProceduresA, GlobalData0, GlobalDataA),
         list.condense(PredProceduresA, ProceduresA)
-    &
+    ,
         list.condense(ListsOfPredIdsB, PredIdsB),
         GlobalData1 = bump_type_num_counter(GlobalData0, type_num_skip),
         list.map_foldl(generate_pred_code_par(ModuleInfo0),
Index: ./compiler/llds_out.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/llds_out.m,v
retrieving revision 1.316
diff -u -r1.316 llds_out.m
--- ./compiler/llds_out.m	20 Aug 2007 03:35:57 -0000	1.316
+++ ./compiler/llds_out.m	4 Sep 2007 07:30:35 -0000
@@ -1784,7 +1784,7 @@
     ->
         set_tree234.insert(ContLabel, !ContLabelSet)
     ;
-        Uinstr = fork(Label1)
+        Uinstr = fork_new_child(_, Label1)
     ->
         set_tree234.insert(Label1, !ContLabelSet)
     ;
@@ -2009,8 +2009,9 @@
     list.foldl2(output_foreign_proc_component_decls, Comps, !DeclSet, !IO).
 output_instr_decls(_, init_sync_term(Lval, _), !DeclSet, !IO) :-
     output_lval_decls(Lval, !DeclSet, !IO).
-output_instr_decls(_, fork(Child), !DeclSet, !IO) :-
-    output_code_addr_decls(code_label(Child), !DeclSet, !IO).
+output_instr_decls(_, fork_new_child(Lval, Child), !DeclSet, !IO) :-
+    output_code_addr_decls(code_label(Child), !DeclSet, !IO),
+    output_lval_decls(Lval, !DeclSet, !IO).
 output_instr_decls(_, join_and_continue(Lval, Label), !DeclSet, !IO) :-
     output_lval_decls(Lval, !DeclSet, !IO),
     output_code_addr_decls(code_label(Label), !DeclSet, !IO).
@@ -2740,8 +2741,10 @@
     io.write_int(N, !IO),
     io.write_string(");\n", !IO).
 
-output_instruction(fork(Child), _, !IO) :-
+output_instruction(fork_new_child(Lval, Child), _, !IO) :-
     io.write_string("\tMR_fork_new_child(", !IO),
+    output_lval_as_word(Lval, !IO),
+    io.write_string(", ", !IO),
     output_label_as_code_addr(Child, !IO),
     io.write_string(");\n", !IO).
 
Index: ./compiler/dupelim.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/dupelim.m,v
retrieving revision 1.91
diff -u -r1.91 dupelim.m
--- ./compiler/dupelim.m	31 Jul 2007 01:56:34 -0000	1.91
+++ ./compiler/dupelim.m	4 Sep 2007 07:30:35 -0000
@@ -422,7 +422,7 @@
         ; Instr0 = incr_sp(_, _, _)
         ; Instr0 = decr_sp(_)
         ; Instr0 = decr_sp_and_return(_)
-        ; Instr0 = fork(_)
+        ; Instr0 = fork_new_child(_, _)
         ; Instr0 = foreign_proc_code(_, _, _, _, _, _, _, _, _)
         ),
         Instr = Instr0
@@ -791,7 +791,7 @@
         ; InstrA = decr_sp(_)
         ; InstrA = decr_sp_and_return(_)
         ; InstrA = foreign_proc_code(_, _, _, _, _, _, _, _, _)
-        ; InstrA = fork(_)
+        ; InstrA = fork_new_child(_, _)
         ; InstrA = init_sync_term(_, _)
         ; InstrA = join_and_continue(_, _)
         ),
Index: ./compiler/livemap.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/livemap.m,v
retrieving revision 1.88
diff -u -r1.88 livemap.m
--- ./compiler/livemap.m	31 Jul 2007 01:56:36 -0000	1.88
+++ ./compiler/livemap.m	4 Sep 2007 07:30:35 -0000
@@ -318,7 +318,7 @@
     ;
         Uinstr0 = init_sync_term(_, _)
     ;
-        Uinstr0 = fork(_)
+        Uinstr0 = fork_new_child(_, _)
     ;
         Uinstr0 = join_and_continue(_, _)
     ;
Index: ./compiler/global_data.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/global_data.m,v
retrieving revision 1.34
diff -u -r1.34 global_data.m
--- ./compiler/global_data.m	20 Aug 2007 03:35:54 -0000	1.34
+++ ./compiler/global_data.m	4 Sep 2007 07:30:35 -0000
@@ -1016,7 +1016,7 @@
         ; Instr0 = incr_sp(_, _, _)
         ; Instr0 = decr_sp(_)
         ; Instr0 = decr_sp_and_return(_)
-        ; Instr0 = fork(_)
+        ; Instr0 = fork_new_child(_, _)
         ),
         Instr = Instr0
     ).
Index: ./compiler/middle_rec.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/middle_rec.m,v
retrieving revision 1.130
diff -u -r1.130 middle_rec.m
--- ./compiler/middle_rec.m	31 Jul 2007 01:56:38 -0000	1.130
+++ ./compiler/middle_rec.m	4 Sep 2007 07:30:35 -0000
@@ -570,7 +570,8 @@
     find_used_registers_components(Components, !Used).
 find_used_registers_instr(init_sync_term(Lval, _), !Used) :-
     find_used_registers_lval(Lval, !Used).
-find_used_registers_instr(fork(_), !Used).
+find_used_registers_instr(fork_new_child(Lval, _), !Used) :-
+    find_used_registers_lval(Lval, !Used).
 find_used_registers_instr(join_and_continue(Lval, _), !Used) :-
     find_used_registers_lval(Lval, !Used).
 
Index: ./compiler/exprn_aux.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/exprn_aux.m,v
retrieving revision 1.84
diff -u -r1.84 exprn_aux.m
--- ./compiler/exprn_aux.m	20 Aug 2007 03:35:54 -0000	1.84
+++ ./compiler/exprn_aux.m	4 Sep 2007 07:30:35 -0000
@@ -353,7 +353,7 @@
         ; Uinstr0 = incr_sp(_, _, _)
         ; Uinstr0 = decr_sp(_)
         ; Uinstr0 = decr_sp_and_return(_)
-        ; Uinstr0 = fork(_)
+        ; Uinstr0 = fork_new_child(_, _)
         ),
         Uinstr = Uinstr0
     ;
Index: ./compiler/opt_util.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/opt_util.m,v
retrieving revision 1.166
diff -u -r1.166 opt_util.m
--- ./compiler/opt_util.m	20 Aug 2007 03:36:03 -0000	1.166
+++ ./compiler/opt_util.m	4 Sep 2007 07:30:35 -0000
@@ -871,7 +871,9 @@
         ; Uinstr = incr_sp(_, _, _)
         ; Uinstr = decr_sp(_)
         ; Uinstr = decr_sp_and_return(_)
-        ; Uinstr = fork(_)
+        ; Uinstr = init_sync_term(_, _)
+        ; Uinstr = fork_new_child(_, _)
+        ; Uinstr = join_and_continue(_, _)
         ),
         Refers = yes
     ;
@@ -938,12 +940,6 @@
         Uinstr = foreign_proc_code(_, Components, _, _, _, _, _, _, _),
         Refers = bool.or_list(list.map(foreign_proc_component_refers_stackvars,
             Components))
-    ;
-        Uinstr = init_sync_term(Lval, _),
-        Refers = lval_refers_stackvars(Lval)
-    ;
-        Uinstr = join_and_continue(Lval, _),
-        Refers = lval_refers_stackvars(Lval)
     ).
 
 :- func foreign_proc_component_refers_stackvars(foreign_proc_component) = bool.
@@ -1085,7 +1081,7 @@
 can_instr_branch_away(decr_sp(_)) = no.
 can_instr_branch_away(decr_sp_and_return(_)) = yes.
 can_instr_branch_away(init_sync_term(_, _)) = no.
-can_instr_branch_away(fork(_)) = no.
+can_instr_branch_away(fork_new_child(_, _)) = no.
 can_instr_branch_away(join_and_continue(_, _)) = yes.
 can_instr_branch_away(foreign_proc_code(_, Comps, _, _, _, _, _, _, _)) =
     can_components_branch_away(Comps).
@@ -1164,7 +1160,7 @@
 can_instr_fall_through(decr_sp(_)) = yes.
 can_instr_fall_through(decr_sp_and_return(_)) = no.
 can_instr_fall_through(init_sync_term(_, _)) = yes.
-can_instr_fall_through(fork(_)) = yes.
+can_instr_fall_through(fork_new_child(_, _)) = yes.
 can_instr_fall_through(join_and_continue(_, _)) = no.
 can_instr_fall_through(foreign_proc_code(_, _, _, _, _, _, _, _, _)) = yes.
 
@@ -1215,7 +1211,7 @@
 can_use_livevals(decr_sp(_), no).
 can_use_livevals(decr_sp_and_return(_), yes).
 can_use_livevals(init_sync_term(_, _), no).
-can_use_livevals(fork(_), no).
+can_use_livevals(fork_new_child(_, _), no).
 can_use_livevals(join_and_continue(_, _), no).
 can_use_livevals(foreign_proc_code(_, _, _, _, _, _, _, _, _), no).
 
@@ -1288,7 +1284,7 @@
     % so late that this predicate should never be invoked on such instructions.
     unexpected(this_file, "instr_labels_2: decr_sp_and_return").
 instr_labels_2(init_sync_term(_, _), [], []).
-instr_labels_2(fork(Child), [Child], []).
+instr_labels_2(fork_new_child(_, Child), [Child], []).
 instr_labels_2(join_and_continue(_, Label), [Label], []).
 instr_labels_2(foreign_proc_code(_, _, _, MaybeFixLabel, MaybeLayoutLabel,
         MaybeOnlyLayoutLabel, MaybeSubLabel, _, _), Labels, []) :-
@@ -1351,7 +1347,7 @@
     % See the comment in instr_labels_2.
     unexpected(this_file, "possible_targets: decr_sp_and_return").
 possible_targets(init_sync_term(_, _), [], []).
-possible_targets(fork(_Child), [], []).
+possible_targets(fork_new_child(_, _), [], []).
 possible_targets(join_and_continue(_, L), [L], []).
 possible_targets(foreign_proc_code(_, _, _, MaybeFixedLabel, MaybeLayoutLabel,
         _, MaybeSubLabel, _, _), Labels, []) :-
@@ -1441,7 +1437,7 @@
 instr_rvals_and_lvals(decr_sp(_), [], []).
 instr_rvals_and_lvals(decr_sp_and_return(_), [], []).
 instr_rvals_and_lvals(init_sync_term(Lval, _), [], [Lval]).
-instr_rvals_and_lvals(fork(_), [], []).
+instr_rvals_and_lvals(fork_new_child(Lval, _), [], [Lval]).
 instr_rvals_and_lvals(join_and_continue(Lval, _), [], [Lval]).
 instr_rvals_and_lvals(foreign_proc_code(_, Cs, _, _, _, _, _, _, _),
         Rvals, Lvals) :-
@@ -1607,7 +1603,8 @@
 count_temps_instr(decr_sp_and_return(_), !R, !F).
 count_temps_instr(init_sync_term(Lval, _), !R, !F) :-
     count_temps_lval(Lval, !R, !F).
-count_temps_instr(fork(_), !R, !F).
+count_temps_instr(fork_new_child(Lval, _), !R, !F) :-
+    count_temps_lval(Lval, !R, !F).
 count_temps_instr(join_and_continue(Lval, _), !R, !F) :-
     count_temps_lval(Lval, !R, !F).
 count_temps_instr(foreign_proc_code(_, Comps, _, _, _, _, _, _, _), !R, !F) :-
@@ -1815,7 +1812,7 @@
         ; Uinstr = save_maxfr(_)
         ; Uinstr = restore_maxfr(_)
         ; Uinstr = init_sync_term(_, _)     % This is a safe approximation.
-        ; Uinstr = fork(_)                  % This is a safe approximation.
+        ; Uinstr = fork_new_child(_, _)     % This is a safe approximation.
         ; Uinstr = join_and_continue(_, _)  % This is a safe approximation.
         ),
         Touch = yes
@@ -2343,9 +2340,10 @@
         ),
         Uinstr = init_sync_term(Lval, N)
     ;
-        Uinstr0 = fork(Child0),
+        Uinstr0 = fork_new_child(Lval0, Child0),
+        replace_labels_lval(Lval0, Lval, ReplMap),
         replace_labels_label(Child0, Child, ReplMap),
-        Uinstr = fork(Child)
+        Uinstr = fork_new_child(Lval, Child)
     ;
         Uinstr0 = join_and_continue(Lval0, Label0),
         replace_labels_label(Label0, Label, ReplMap),
Index: ./compiler/use_local_vars.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/use_local_vars.m,v
retrieving revision 1.35
diff -u -r1.35 use_local_vars.m
--- ./compiler/use_local_vars.m	31 Jul 2007 01:56:41 -0000	1.35
+++ ./compiler/use_local_vars.m	4 Sep 2007 07:30:35 -0000
@@ -674,7 +674,7 @@
         ; Uinstr0 = decr_sp(_)
         ; Uinstr0 = decr_sp_and_return(_)
         ; Uinstr0 = init_sync_term(_, _)
-        ; Uinstr0 = fork(_)
+        ; Uinstr0 = fork_new_child(_, _)
         ; Uinstr0 = join_and_continue(_, _)
         ; Uinstr0 = arbitrary_c_code(_, _, _)
         )
Index: ./compiler/llds.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/llds.m,v
retrieving revision 1.352
diff -u -r1.352 llds.m
--- ./compiler/llds.m	20 Aug 2007 03:35:57 -0000	1.352
+++ ./compiler/llds.m	4 Sep 2007 07:30:35 -0000
@@ -546,16 +546,19 @@
             % duplicated by jump optimization.
 
     ;       init_sync_term(lval, int)
-            % Initialize a synchronization term. The first argument contains
-            % the lvalue into which we will store the synchronization term,
-            % and the second argument indicates how many branches we expect
-            % to join at the end of the parallel conjunction. (See the
-            % documentation in par_conj_gen.m and runtime/mercury_context.{c,h}
-            % for further information about synchronisation terms.)
-
-    ;       fork(label)
-            % Create a new spark. fork(Child) creates spark, to begin execution
-            % at Child. Control continues at the next instruction.
+            % Initialize a synchronization term, which is a continuous number
+            % of slots on the detstack.  The first argument contains the base
+            % address of the synchronization term.  The second argument
+            % indicates how many branches we expect to join at the end of the
+            % parallel conjunction. (See the documentation in par_conj_gen.m
+            % and runtime/mercury_context.{c,h} for further information about
+            % synchronisation terms.)
+
+    ;       fork_new_child(lval, label)
+            % Create a new spark. fork(SyncTerm, Child) creates spark, to begin
+            % execution at Child, where SyncTerm contains the base address of
+            % the synchronisation term. Control continues at the next
+            % instruction.
 
     ;       join_and_continue(lval, label).
             % Signal that this thread of execution has finished in the current
Index: ./compiler/jumpopt.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/jumpopt.m,v
retrieving revision 1.106
diff -u -r1.106 jumpopt.m
--- ./compiler/jumpopt.m	20 Aug 2007 03:35:56 -0000	1.106
+++ ./compiler/jumpopt.m	4 Sep 2007 07:30:35 -0000
@@ -785,12 +785,12 @@
         % for the last time.
         unexpected(this_file, "instr_list: block")
     ;
-        Uinstr0 = fork(Child0),
+        Uinstr0 = fork_new_child(SyncTerm, Child0),
         short_label(Instrmap, Child0, Child),
         ( Child = Child0 ->
             NewRemain = usual_case
         ;
-            Uinstr = fork(Child),
+            Uinstr = fork_new_child(SyncTerm, Child),
             Comment = Comment0 ++ " (redirect)",
             Instr = llds_instr(Uinstr, Comment),
             NewRemain = specified([Instr], Instrs0)
Index: ./compiler/reassign.m
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/compiler/reassign.m,v
retrieving revision 1.25
diff -u -r1.25 reassign.m
--- ./compiler/reassign.m	31 Jul 2007 01:56:40 -0000	1.25
+++ ./compiler/reassign.m	4 Sep 2007 07:30:35 -0000
@@ -316,7 +316,7 @@
         !:RevInstrs = [Instr0 | !.RevInstrs],
         clobber_dependents(Target, !KnownContentsMap, !DepLvalMap)
     ;
-        Uinstr0 = fork(_),
+        Uinstr0 = fork_new_child(_, _),
         !:RevInstrs = [Instr0 | !.RevInstrs],
         % Both the parent and the child thread jump to labels specified
         % by the fork instruction, so the value of !:KnownContentsMap doesn't
Index: ./runtime/mercury_context.h
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_context.h,v
retrieving revision 1.43
diff -u -r1.43 mercury_context.h
--- ./runtime/mercury_context.h	10 May 2007 05:24:16 -0000	1.43
+++ ./runtime/mercury_context.h	4 Sep 2007 07:30:35 -0000
@@ -25,10 +25,10 @@
 ** by `context_ptr'. The context contains no rN or fN registers - all
 ** registers are "context save" (by analogy to caller-save).
 **
-** When a new context is created information is passed to and from the
-** new context via the stack frame of the procedure that originated the
-** parallel conjunction. The code of a parallel conjunct has access
-** to the procedure's stack frame via the `parent_sp' register.
+** When a new context is created for a parallel conjunction, information is
+** passed to and from the new context via the stack frame of the procedure that
+** originated the parallel conjunction. The code of a parallel conjunct has
+** access to that original stack frame via the `parent_sp' register.
 **
 ** Contexts can migrate transparently between multiple POSIX threads.
 **
@@ -59,6 +59,14 @@
 #include "mercury_goto.h"       /* for MR_GOTO() */
 #include "mercury_conf.h"       /* for MR_CONSERVATIVE_GC */
 
+#if !defined(MR_HIGHLEVEL_CODE) && defined(MR_THREAD_SAFE)
+  /*
+  ** Whether we are in a grade which supports the low-level parallel
+  ** conjunction execution mechanism.
+  */
+  #define MR_LL_PARALLEL_CONJ 1
+#endif
+
 #ifdef  MR_THREAD_SAFE
   #define MR_IF_THREAD_SAFE(x)  x
   #define MR_IF_NOT_THREAD_SAFE(x) 
@@ -169,7 +177,7 @@
 ** parent_sp        The saved parent_sp for this context.
 **                  (Accessed via abstract machine register.)
 **
-** spark_stack      The sparks generated by this context, in a stack.
+** spark_deque      The sparks generated by this context.
 **                  (Accessed usually by explicitly specifying the context,
 **                  but also via MR_eng_this_context.)
 **
@@ -194,8 +202,23 @@
 **                  (Accessed via MR_eng_this_context.)
 */
 
+/*
+** A spark contains just enough information to begin execution of a parallel
+** conjunct.  Sparks exist on a context's local spark deque or in the global
+** spark queue.  In the former case, a spark will eventually be executed in the
+** same context (same detstack, etc.) as the code that generated the spark. In
+** the latter case the spark can be picked up and executed by any idle engine
+** in a different context.
+**
+** In the current implementation a spark is put on the global spark queue if,
+** at the time a fork instruction is reached, we think the spark has a chance
+** of being picked up for execution by an idle engine.  Otherwise the spark
+** goes on the context's spark stack. At the moment this is an irrevocable
+** decision. A future possibility is to allow idle engines to steal work
+** from the cold end of some context's spark deque.
+*/
+
 typedef struct MR_Context_Struct        MR_Context;
-typedef struct MR_Spark_Struct          MR_Spark;
 
 typedef enum {
     MR_CONTEXT_SIZE_REGULAR,
@@ -219,6 +242,24 @@
     MR_MemoryZones      *MR_zones_tail;
 };
 
+#ifdef MR_LL_PARALLEL_CONJ
+typedef struct MR_Spark_Struct          MR_Spark;
+typedef struct MR_SparkDeque_Struct     MR_SparkDeque;
+typedef struct MR_SparkArray_Struct     MR_SparkArray;
+
+struct MR_Spark_Struct {
+    MR_Code                 *MR_spark_resume;
+    MR_Word                 *MR_spark_parent_sp;
+    MR_ThreadLocalMuts      *MR_spark_thread_local_mutables;
+};
+
+struct MR_SparkDeque_Struct {
+    volatile MR_Integer     MR_sd_bottom;
+    volatile MR_Integer     MR_sd_top;
+    volatile MR_SparkArray  *MR_sd_active_array;
+};
+#endif  /* !MR_LL_PARALLEL_CONJ */ 
+
 struct MR_Context_Struct {
     const char          *MR_ctxt_id;
     MR_ContextSize      MR_ctxt_size;
@@ -257,8 +298,10 @@
     MR_Generator        *MR_ctxt_owner_generator;
   #endif /* MR_USE_MINIMAL_MODEL_OWN_STACKS */
 
+  #ifdef MR_LL_PARALLEL_CONJ
     MR_Word             *MR_ctxt_parent_sp;
-    MR_Spark            *MR_ctxt_spark_stack;
+    MR_SparkDeque       MR_ctxt_spark_deque;
+  #endif
 #endif /* !MR_HIGHLEVEL_CODE */
 
 #ifdef  MR_USE_TRAIL
@@ -283,33 +326,8 @@
 };
 
 /*
-** A spark contains just enough information to begin execution of a parallel
-** conjunct.  Sparks are allocated on the heap, and can be stored in a
-** context's spark stack, or in the global spark queue.  In the former case,
-** a spark will eventually be executed in the same context (same detstack,
-** etc.) as the code that generated the spark. In the latter case the spark
-** can be picked up and executed by any idle engine in a different context.
-**
-** In the current implementation a spark is put on the global spark queue if,
-** at the time a fork instruction is reached, we think the spark has a chance
-** of being picked up for execution by an idle engine.  Otherwise the spark
-** goes on the context's spark stack. At the moment this is an irrevocable
-** decision. A future possibility is to allow idle engines to steal work
-** from the cold end of some context's spark stack.
-*/
-
-#ifndef MR_HIGHLEVEL_CODE
-struct MR_Spark_Struct {
-    MR_Spark            *MR_spark_next;
-    MR_Code             *MR_spark_resume;
-    MR_Word             *MR_spark_parent_sp;
-    MR_ThreadLocalMuts  *MR_spark_thread_local_mutables;
-};
-#endif
-
-/*
 ** The runqueue is a linked list of contexts that are runnable.
-** The spark_queue is a linked list of sparks that are runnable.
+** The spark_queue is an array of sparks that are runnable.
 ** We keep them separate to prioritise contexts (which are mainly
 ** computations which have already started) over sparks (which are
 ** computations which have not begun).
@@ -317,14 +335,14 @@
 
 extern      MR_Context  *MR_runqueue_head;
 extern      MR_Context  *MR_runqueue_tail;
-#ifndef MR_HIGHLEVEL_CODE
-  extern    MR_Spark    *MR_spark_queue_head;
-  extern    MR_Spark    *MR_spark_queue_tail;
-#endif
 #ifdef  MR_THREAD_SAFE
   extern    MercuryLock MR_runqueue_lock;
   extern    MercuryCond MR_runqueue_cond;
 #endif
+#ifdef  MR_LL_PARALLEL_CONJ
+  extern    MR_SparkDeque   MR_spark_queue;
+  extern    MercuryLock     MR_sync_term_lock;
+#endif
 
 /*
 ** As well as the runqueue, we maintain a linked list of contexts
@@ -361,32 +379,36 @@
   extern    MercuryLock     MR_pending_contexts_lock;
 #endif
 
-/*
-** The number of engines waiting for work.
-** We don't protect it with a separate lock, but updates to it are made while
-** holding the MR_runqueue_lock.  Reads are made without the lock.
-** XXX We may need to use atomic instructions or memory fences on some
-** architectures.
-*/
-extern  int         MR_num_idle_engines;
-
-/*
-** The number of contexts that are not in the free list (i.e. are executing or
-** suspended) plus the number of sparks in the spark queue.  We count those
-** sparks as they can quickly accumulate on the spark queue before any of them
-** are taken up for execution.  Once they do get taken up, many contexts would
-** need to be allocated to execute them.  Sparks not on the spark queue are
-** currently guaranteed to be executed on their originating context so won't
-** cause allocation of more contexts.
-**
-** What we are mainly interested in here is preventing too many contexts from
-** being allocated, as each context is quite large and we can quickly run out
-** of memory.  Another problem is due to the context free list and conservative
-** garbage collection: every context ever allocated will be scanned.  (Getting
-** the garbage collector not to scan contexts on the free list should be
-** possible though.)
-*/
-extern  int         MR_num_outstanding_contexts_and_sparks;
+#ifdef  MR_LL_PARALLEL_CONJ
+  /*
+  ** The number of engines waiting for work.
+  ** We don't protect it with a separate lock, but updates to it are made while
+  ** holding the MR_runqueue_lock.  Reads are made without the lock.
+  ** XXX We may need to use atomic instructions or memory fences on some
+  ** architectures.
+  */
+  extern    int     MR_num_idle_engines;
+
+  /*
+  ** The number of contexts that are not in the free list (i.e. are executing
+  ** or suspended) plus the number of sparks in the spark queue.  We count
+  ** those sparks as they can quickly accumulate on the spark queue before any
+  ** of them are taken up for execution.  Once they do get taken up, many
+  ** contexts would need to be allocated to execute them.  Sparks not on the
+  ** spark queue are currently guaranteed to be executed on their originating
+  ** context so won't cause allocation of more contexts.
+  **
+  ** What we are mainly interested in here is preventing too many contexts from
+  ** being allocated, as each context is quite large and we can quickly run out
+  ** of memory.  Another problem is due to the context free list and
+  ** conservative garbage collection: every context ever allocated will be
+  ** scanned.  (Getting the garbage collector not to scan contexts on the free
+  ** list should be possible though.)
+  */
+  extern    int     MR_num_outstanding_contexts_and_sparks;
+#endif  /* !MR_LL_PARALLEL_CONJ */
+
+/*---------------------------------------------------------------------------*/
 
 /*
 ** Initializes a context structure, and gives it the given id. If gen is
@@ -415,7 +437,7 @@
 extern  void        MR_init_thread_stuff(void);
 
 /*
-** MR_finialize_runqueue() finalizes the lock structures for the runqueue.
+** MR_finalize_runqueue() finalizes the lock structures for the runqueue.
 */
 extern  void        MR_finalize_runqueue(void);
 
@@ -429,15 +451,14 @@
 /*
 ** Append the given context onto the end of the run queue.
 */
-
 extern  void        MR_schedule_context(MR_Context *ctxt);
 
-#ifndef MR_HIGHLEVEL_CODE
-/*
-** Append the given spark onto the end of the spark queue.
-*/
-extern  void        MR_schedule_spark_globally(MR_Spark *spark);
-#endif /* !MR_HIGHLEVEL_CODE */
+#ifdef MR_LL_PARALLEL_CONJ
+  /*
+  ** Append the given spark onto the end of the spark queue.
+  */
+  extern    void    MR_schedule_spark_globally(const MR_Spark *spark);
+#endif /* !MR_LL_PARALLEL_CONJ */
 
 #ifndef MR_HIGHLEVEL_CODE
   MR_declare_entry(MR_do_runnext);
@@ -447,58 +468,6 @@
     } while (0)
 #endif
 
-#ifdef  MR_THREAD_SAFE
-  #define MR_IF_MR_THREAD_SAFE(x)   x
-#else
-  #define MR_IF_MR_THREAD_SAFE(x)
-#endif
-
-#ifndef MR_HIGHLEVEL_CODE
-  /*
-  ** fork_new_child(MR_Code *child):
-  **
-  ** Create a new spark to execute the code at `child'.  The new spark is put
-  ** on the global spark queue or the context-local spark stack.  The current
-  ** context resumes at `parent'.  MR_parent_sp must already be set
-  ** appropriately before this instruction is executed.
-  */
-  #define MR_fork_new_child(child)                              \
-    do {                                                        \
-        MR_Spark *fnc_spark;                                    \
-                                                                \
-        fnc_spark = MR_GC_NEW(MR_Spark);                        \
-        fnc_spark->MR_spark_resume = (child);                   \
-        fnc_spark->MR_spark_parent_sp = MR_parent_sp;           \
-        fnc_spark->MR_spark_thread_local_mutables = MR_THREAD_LOCAL_MUTABLES; \
-        if (MR_fork_globally_criteria) {                        \
-            MR_schedule_spark_globally(fnc_spark);              \
-        } else {                                                \
-            MR_schedule_spark_locally(fnc_spark);               \
-        }                                                       \
-    } while (0)
-
-  #define MR_fork_globally_criteria                             \
-    (MR_num_idle_engines != 0 &&                                \
-    MR_num_outstanding_contexts_and_sparks < MR_max_outstanding_contexts)
-
-  #define MR_schedule_spark_locally(spark)                              \
-    do {                                                                \
-        MR_Context  *ssl_ctxt;                                          \
-                                                                        \
-        /*                                                              \
-        ** Only the engine running the context is allowed to access     \
-        ** the context's spark stack, so no locking is required here.   \
-        */                                                              \
-        ssl_ctxt = MR_ENGINE(MR_eng_this_context);                      \
-        spark->MR_spark_next = ssl_ctxt->MR_ctxt_spark_stack;           \
-        ssl_ctxt->MR_ctxt_spark_stack = spark;                          \
-    } while (0)
-
-  #define MR_choose_parallel_over_sequential_cond(target_cpus)          \
-      (MR_num_outstanding_contexts_and_sparks < target_cpus)
-
-#endif /* !MR_HIGHLEVEL_CODE */
-
 #ifndef MR_CONSERVATIVE_GC
 
   /*
@@ -714,118 +683,197 @@
         /* it wouldn't be appropriate to copy the resume field */             \
         to_cptr->MR_ctxt_thread_local_mutables =                              \
             from_cptr->MR_ctxt_thread_local_mutables;                         \
-        /* it wouldn't be appropriate to copy the spark_stack field */        \
+        /* it wouldn't be appropriate to copy the spark_queue field */        \
         /* it wouldn't be appropriate to copy the saved_owners field */       \
     } while (0)
 
-/*
-** If you change MR_Sync_Term_Struct you need to update configure.in.
-*/
-typedef struct MR_Sync_Term_Struct MR_SyncTerm;
-struct MR_Sync_Term_Struct {
-  #ifdef MR_THREAD_SAFE
-    MercuryLock     MR_st_lock;
-  #endif
+/*---------------------------------------------------------------------------*/
+
+#ifdef MR_LL_PARALLEL_CONJ
+
+  /*
+  ** If you change MR_SyncTerm_Struct you need to update configure.in.
+  **
+  ** MR_st_count is `int' so that on a 64-bit machine the total size of the
+  ** sync term is two words, not three words (assuming `int' is 32 bits).
+  **
+  ** XXX we should remove that assumption but it's a little tricky because
+  ** configure needs to understand the types as well
+  */
+
+  typedef struct MR_SyncTerm_Struct MR_SyncTerm;
+
+  struct MR_SyncTerm_Struct {
     MR_Context      *MR_st_orig_context;
-    int             MR_st_count;
-    MR_Context      *MR_st_parent;
-};
+    volatile int    MR_st_count;
+    volatile int    MR_st_is_shared;
+  };
+
+  #define MR_init_sync_term(sync_term, nbranches)                             \
+    do {                                                                      \
+        MR_SyncTerm *init_st = (MR_SyncTerm *) &(sync_term);                  \
+                                                                              \
+        init_st->MR_st_orig_context = MR_ENGINE(MR_eng_this_context);         \
+        init_st->MR_st_count = (nbranches);                                   \
+        init_st->MR_st_is_shared = MR_FALSE;                                  \
+    } while (0)
+
+  /*
+  ** fork_new_child(MR_SyncTerm st, MR_Code *child):
+  **
+  ** Create a new spark to execute the code at `child'.  The new spark is put
+  ** on the global spark queue or the context-local spark deque.  The current
+  ** context resumes at `parent'.  MR_parent_sp must already be set
+  ** appropriately before this instruction is executed.
+  **
+  ** If the spark ends up on the global spark queue then we set
+  ** `MR_st_is_shared' to true as branches of this parallel conjunction could
+  ** be executed in parallel.
+  */
+  #define MR_fork_new_child(sync_term, child)                                 \
+    do {                                                                      \
+        MR_Spark fnc_spark;                                                   \
+                                                                              \
+        fnc_spark.MR_spark_resume = (child);                                  \
+        fnc_spark.MR_spark_parent_sp = MR_parent_sp;                          \
+        fnc_spark.MR_spark_thread_local_mutables = MR_THREAD_LOCAL_MUTABLES;  \
+        if (MR_fork_globally_criteria) {                                      \
+            MR_SyncTerm *fnc_st = (MR_SyncTerm *) &(sync_term);               \
+            fnc_st->MR_st_is_shared = MR_TRUE;                                \
+            MR_schedule_spark_globally(&fnc_spark);                           \
+        } else {                                                              \
+            MR_schedule_spark_locally(&fnc_spark);                            \
+        }                                                                     \
+    } while (0)
+
+  #define MR_fork_globally_criteria                                           \
+    (MR_num_idle_engines != 0 &&                                              \
+    MR_num_outstanding_contexts_and_sparks < MR_max_outstanding_contexts)
 
-#define MR_is_orig_context(ctxt, st)    \
-    ((ctxt) == (st)->MR_st_orig_context)
+  #define MR_choose_parallel_over_sequential_cond(target_cpus)                \
+      (MR_num_outstanding_contexts_and_sparks < target_cpus)
 
-#define MR_init_sync_term(sync_term, nbranches)                     \
-    do {                                                            \
-        MR_SyncTerm *st;                                            \
-                                                                    \
-        st = (MR_SyncTerm *) sync_term;                             \
-        MR_assert(st != NULL);                                      \
-        MR_IF_THREAD_SAFE(                                          \
-            pthread_mutex_init(&(st->MR_st_lock), MR_MUTEX_ATTR);   \
-        )                                                           \
-        st->MR_st_orig_context = MR_ENGINE(MR_eng_this_context);    \
-        st->MR_st_count = (nbranches);                              \
-        st->MR_st_parent = NULL;                                    \
+  #define MR_schedule_spark_locally(spark)                                    \
+    do {                                                                      \
+        MR_Context  *ssl_ctxt;                                                \
+                                                                              \
+        /*                                                                    \
+        ** Only the engine running the context is allowed to access           \
+        ** the context's spark stack, so no locking is required here.         \
+        */                                                                    \
+        ssl_ctxt = MR_ENGINE(MR_eng_this_context);                            \
+        MR_wsdeque_push_bottom(&ssl_ctxt->MR_ctxt_spark_deque, (spark));      \
     } while (0)
 
-#define MR_join_and_continue(sync_term, join_label)                           \
+  #define MR_join_and_continue(sync_term, join_label)                         \
     do {                                                                      \
-        MR_SyncTerm *st;                                                      \
+        MR_SyncTerm *jnc_st = (MR_SyncTerm *) &sync_term;                     \
                                                                               \
-        st = (MR_SyncTerm *) sync_term;                                       \
-        MR_assert(st != NULL);                                                \
-        MR_LOCK(&(st->MR_st_lock), "continue");                               \
-        MR_assert(st->MR_st_count > 0);                                       \
-        (st->MR_st_count)--;                                                  \
-        if (st->MR_st_count == 0) {                                           \
-            if (MR_is_orig_context(MR_ENGINE(MR_eng_this_context), st)) {     \
-                /*                                                            \
-                ** This context originated this parallel conjunction and      \
-                ** all the branches have finished so jump to the join label.  \
-                */                                                            \
-                MR_assert(st->MR_st_parent == NULL);                          \
-                MR_UNLOCK(&(st->MR_st_lock), "continue i");                   \
+        if (!jnc_st->MR_st_is_shared) {                                       \
+            /* This parallel conjunction has only executed sequentially. */   \
+            if (--jnc_st->MR_st_count == 0) {                                 \
                 MR_GOTO(join_label);                                          \
             } else {                                                          \
-                /*                                                            \
-                ** This context didn't originate this parallel conjunction.   \
-                ** We're the last branch to finish.  The originating          \
-                ** context should be suspended waiting for us to finish,      \
-                ** so wake it up.                                             \
-                */                                                            \
-                MR_assert(st->MR_st_parent != NULL);                          \
-                st->MR_st_parent->MR_ctxt_resume = join_label;                \
-                MR_schedule_context(st->MR_st_parent);                        \
-                MR_UNLOCK(&(st->MR_st_lock), "continue ii");                  \
-                MR_runnext();                                                 \
+                MR_join_and_continue_1();                                     \
             }                                                                 \
         } else {                                                              \
-            MR_join_and_continue_2(st);                                       \
+            MR_Context *jnc_orig_ctxt;                                        \
+                                                                              \
+            /* This parallel conjunction may be executing in parallel. */     \
+            jnc_orig_ctxt = jnc_st->MR_st_orig_context;                       \
+            MR_LOCK(&MR_sync_term_lock, "continue");                          \
+            if (--jnc_st->MR_st_count == 0) {                                 \
+                if (MR_sp == MR_parent_sp) {                                  \
+                    /*                                                        \
+                    ** This context originated this parallel conjunction and  \
+                    ** all the branches have finished so jump to the join     \
+                    ** label.                                                 \
+                    */                                                        \
+                    MR_UNLOCK(&MR_sync_term_lock, "continue i");              \
+                    MR_GOTO(join_label);                                      \
+                } else {                                                      \
+                    /*                                                        \
+                    ** This context didn't originate this parallel            \
+                    ** conjunction and we're the last branch to finish.  The  \
+                    ** originating context should be suspended waiting for us \
+                    ** to finish, so wake it up.                              \
+                    */                                                        \
+                    jnc_st->MR_st_orig_context->MR_ctxt_resume = join_label;  \
+                    MR_schedule_context(jnc_st->MR_st_orig_context);          \
+                    MR_UNLOCK(&MR_sync_term_lock, "continue ii");             \
+                    MR_runnext();                                             \
+                }                                                             \
+            } else {                                                          \
+                MR_join_and_continue_2();                                     \
+            }                                                                 \
+        }                                                                     \
+    } while (0)
+
+  #define MR_join_and_continue_1()                                            \
+    do {                                                                      \
+        MR_Context  *jnc_ctxt;                                                \
+        MR_bool     jnc_popped;                                               \
+        MR_Spark    jnc_spark;                                                \
+                                                                              \
+        jnc_ctxt = MR_ENGINE(MR_eng_this_context);                            \
+        jnc_popped = MR_wsdeque_pop_bottom(&jnc_ctxt->MR_ctxt_spark_deque,    \
+            &jnc_spark);                                                      \
+        if (jnc_popped) {                                                     \
+            MR_GOTO(jnc_spark.MR_spark_resume);                               \
+        } else {                                                              \
+            MR_runnext();                                                     \
         }                                                                     \
     } while (0)
 
-#define MR_join_and_continue_2(st)                                            \
+  #define MR_join_and_continue_2()                                            \
     do {                                                                      \
-        MR_Context  *ctxt;                                                    \
-        MR_Spark    *spark;                                                   \
+        MR_Context  *jnc_ctxt;                                                \
+        MR_bool     jnc_popped;                                               \
+        MR_Spark    jnc_spark;                                                \
                                                                               \
-        ctxt = MR_ENGINE(MR_eng_this_context);                                \
-        spark = ctxt->MR_ctxt_spark_stack;                                    \
-        if (spark && (spark->MR_spark_parent_sp == MR_parent_sp)) {           \
+        jnc_ctxt = MR_ENGINE(MR_eng_this_context);                            \
+        jnc_popped = MR_wsdeque_pop_bottom(&jnc_ctxt->MR_ctxt_spark_deque,    \
+            &jnc_spark);                                                      \
+        if (jnc_popped && (jnc_spark.MR_spark_parent_sp == MR_parent_sp)) {   \
             /*                                                                \
             ** The spark at the top of the stack is due to the same parallel  \
             ** conjunction that we've just been executing. We can immediately \
             ** execute the next branch of the same parallel conjunction in    \
             ** the current context.                                           \
             */                                                                \
-            MR_UNLOCK(&(st->MR_st_lock), "continue_2 i");                     \
-            ctxt->MR_ctxt_spark_stack = spark->MR_spark_next;                 \
-            MR_GOTO(spark->MR_spark_resume);                                  \
+            MR_UNLOCK(&MR_sync_term_lock, "continue_2 i");                    \
+            MR_GOTO(jnc_spark.MR_spark_resume);                               \
         } else {                                                              \
             /*                                                                \
             ** The spark stack is empty or the next spark is from a different \
             ** parallel conjunction to the one we've been executing.  Either  \
             ** way, there's nothing more we can do with this context right    \
-            ** now.                                                           \
-            **                                                                \
+            ** now.  Put back the spark we won't be using.                    \
+            */                                                                \
+            if (jnc_popped) {                                                 \
+                MR_wsdeque_putback_bottom(&jnc_ctxt->MR_ctxt_spark_deque,     \
+                    &jnc_spark);                                              \
+            }                                                                 \
+            /*                                                                \
             ** If this context originated the parallel conjunction we've been \
             ** executing, the rest of the parallel conjunction must have been \
             ** put on the global spark queue to be executed in other          \
             ** contexts.  This context will need to be resumed once the       \
-            ** parallel conjunction is completed, so suspend the context and  \
-            ** save the address in the sync term.                             \
-            **                                                                \
-            ** Finally look for other work.                                   \
+            ** parallel conjunction is completed, so suspend the context.     \
             */                                                                \
-            if (MR_is_orig_context(MR_ENGINE(MR_eng_this_context), st)) {     \
-                MR_save_context(MR_ENGINE(MR_eng_this_context));              \
-                MR_assert(st->MR_st_parent == NULL);                          \
-                st->MR_st_parent = ctxt;                                      \
+            if (MR_sp == MR_parent_sp) {                                      \
+                MR_save_context(jnc_ctxt);                                    \
                 MR_ENGINE(MR_eng_this_context) = NULL;                        \
             }                                                                 \
-            MR_UNLOCK(&(st->MR_st_lock), "continue_2 ii");                    \
+            /* Finally look for other work. */                                \
+            MR_UNLOCK(&MR_sync_term_lock, "continue_2 ii");                   \
             MR_runnext();                                                     \
         }                                                                     \
-    } while (0)                                                               \
+    } while (0)
+
+  /* This needs to come after the definition of MR_SparkDeque_Struct. */
+  #include "mercury_wsdeque.h"
+
+#endif /* not MR_LL_PARALLEL_CONJ */
 
 #endif /* not MERCURY_CONTEXT_H */
Index: ./runtime/Mmakefile
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/Mmakefile,v
retrieving revision 1.140
diff -u -r1.140 Mmakefile
--- ./runtime/Mmakefile	20 Jul 2007 01:22:05 -0000	1.140
+++ ./runtime/Mmakefile	4 Sep 2007 07:30:35 -0000
@@ -25,6 +25,7 @@
 			mercury_accurate_gc.h	\
 			mercury_agc_debug.h	\
 			mercury_array_macros.h	\
+			mercury_atomic.h 	\
 			mercury_bootstrap.h	\
 			mercury_builtin_types.h	\
 			mercury_builtin_types_proc_layouts.h	\
@@ -97,7 +98,8 @@
 			mercury_types.h		\
 			mercury_type_tables.h	\
 			mercury_univ.h		\
-			mercury_wrapper.h
+			mercury_wrapper.h	\
+			mercury_wsdeque.h
 
 # The headers in $(BODY_HDRS) contain code schemes included multiple times
 # in one or more source files. Their dependencies must be explicitly listed.
@@ -195,7 +197,8 @@
 			mercury_type_desc.c	\
 			mercury_type_info.c	\
 			mercury_type_tables.c	\
-			mercury_wrapper.c
+			mercury_wrapper.c	\
+			mercury_wsdeque.c
 
 #-----------------------------------------------------------------------------#
 
Index: ./runtime/mercury_wrapper.c
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_wrapper.c,v
retrieving revision 1.182
diff -u -r1.182 mercury_wrapper.c
--- ./runtime/mercury_wrapper.c	1 May 2007 01:11:42 -0000	1.182
+++ ./runtime/mercury_wrapper.c	4 Sep 2007 07:30:35 -0000
@@ -2486,8 +2486,8 @@
 
 MR_define_label(all_done);
     assert(MR_runqueue_head == NULL);
-#ifndef MR_HIGHLEVEL_CODE
-    assert(MR_spark_queue_head == NULL);
+#ifdef MR_LL_PARALLEL_CONJ
+    assert(MR_wsdeque_is_empty(&MR_spark_queue));
 #endif
 
 #ifdef  MR_MPROF_PROFILE_TIME
Index: ./runtime/mercury_context.c
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_context.c,v
retrieving revision 1.56
diff -u -r1.56 mercury_context.c
--- ./runtime/mercury_context.c	1 May 2007 01:11:42 -0000	1.56
+++ ./runtime/mercury_context.c	4 Sep 2007 07:30:35 -0000
@@ -40,17 +40,24 @@
 /*
 ** The run queue and spark queue are protected and signalled with the
 ** same lock and condition variable.
+**
+** The single sync term lock is used to prevent races in MR_join_and_continue.
+** The holder of the sync term lock may acquire the runqueue lock but not vice
+** versa.  (We could also have one sync term lock per context, and make
+** MR_join_and_continue acquire the sync term lock of the context that
+** originated the parallel conjunction, but contention of the single lock
+** doesn't seem to be an issue.)
 */
 MR_Context              *MR_runqueue_head;
 MR_Context              *MR_runqueue_tail;
-#ifndef MR_HIGHLEVEL_CODE
-  MR_Spark              *MR_spark_queue_head;
-  MR_Spark              *MR_spark_queue_tail;
-#endif
 #ifdef  MR_THREAD_SAFE
   MercuryLock           MR_runqueue_lock;
   MercuryCond           MR_runqueue_cond;
 #endif
+#ifdef  MR_LL_PARALLEL_CONJ
+  MR_SparkDeque         MR_spark_queue;
+  MercuryLock           MR_sync_term_lock;
+#endif
 
 MR_PendingContext       *MR_pending_contexts;
 #ifdef  MR_THREAD_SAFE
@@ -69,8 +76,10 @@
   static MercuryLock    free_context_list_lock;
 #endif
 
+#ifdef  MR_LL_PARALLEL_CONJ
 int MR_num_idle_engines = 0;
 int MR_num_outstanding_contexts_and_sparks = 0;
+#endif
 
 /*---------------------------------------------------------------------------*/
 
@@ -84,6 +93,10 @@
     pthread_mutex_init(&free_context_list_lock, MR_MUTEX_ATTR);
     pthread_mutex_init(&MR_global_lock, MR_MUTEX_ATTR);
     pthread_mutex_init(&MR_pending_contexts_lock, MR_MUTEX_ATTR);
+  #ifdef MR_LL_PARALLEL_CONJ
+    MR_init_wsdeque(&MR_spark_queue, MR_INITIAL_GLOBAL_SPARK_QUEUE_SIZE);
+    pthread_mutex_init(&MR_sync_term_lock, MR_MUTEX_ATTR);
+  #endif
   #ifndef MR_THREAD_LOCAL_STORAGE
     MR_KEY_CREATE(&MR_engine_base_key, NULL);
   #endif
@@ -106,6 +119,9 @@
     pthread_cond_destroy(&MR_runqueue_cond);
     pthread_mutex_destroy(&free_context_list_lock);
 #endif
+#ifdef  MR_LL_PARALLEL_CONJ
+    pthread_mutex_destroy(&MR_sync_term_lock);
+#endif
 }
 
 static void 
@@ -236,8 +252,11 @@
     c->MR_ctxt_owner_generator = gen;
   #endif /* MR_USE_MINIMAL_MODEL_OWN_STACKS */
 
+  #ifdef MR_LL_PARALLEL_CONJ
     c->MR_ctxt_parent_sp = NULL;
-    c->MR_ctxt_spark_stack = NULL;
+    MR_init_wsdeque(&c->MR_ctxt_spark_deque,
+        MR_INITIAL_LOCAL_SPARK_DEQUE_SIZE);
+  #endif /* MR_LL_PARALLEL_CONJ */
 
 #endif /* !MR_HIGHLEVEL_CODE */
 
@@ -283,7 +302,9 @@
 
     MR_LOCK(&free_context_list_lock, "create_context");
 
+#ifdef MR_LL_PARALLEL_CONJ
     MR_num_outstanding_contexts_and_sparks++;
+#endif
 
     /*
     ** Regular contexts have stacks at least as big as small contexts,
@@ -309,6 +330,9 @@
         c->MR_ctxt_detstack_zone = NULL;
         c->MR_ctxt_nondetstack_zone = NULL;
 #endif
+#ifdef MR_LL_PARALLEL_CONJ
+        c->MR_ctxt_spark_deque.MR_sd_active_array = NULL;
+#endif
 #ifdef MR_USE_TRAIL
         c->MR_ctxt_trail_zone = NULL;
 #endif
@@ -326,9 +350,6 @@
 #ifdef MR_THREAD_SAFE
     MR_assert(c->MR_ctxt_saved_owners == NULL);
 #endif
-#ifndef MR_HIGHLEVEL_CODE
-    MR_assert(c->MR_ctxt_spark_stack == NULL);
-#endif
 
     /* XXX not sure if this is an overall win yet */
 #if 0 && defined(MR_CONSERVATIVE_GC) && !defined(MR_HIGHLEVEL_CODE)
@@ -340,7 +361,9 @@
 #endif /* defined(MR_CONSERVATIVE_GC) && !defined(MR_HIGHLEVEL_CODE) */
 
     MR_LOCK(&free_context_list_lock, "destroy_context");
+#ifdef MR_LL_PARALLEL_CONJ
     MR_num_outstanding_contexts_and_sparks--;
+#endif
 
     switch (c->MR_ctxt_size) {
         case MR_CONTEXT_SIZE_REGULAR:
@@ -476,27 +499,21 @@
     MR_UNLOCK(&MR_runqueue_lock, "schedule_context");
 }
 
-#ifndef MR_HIGHLEVEL_CODE
-
+#ifdef MR_LL_PARALLEL_CONJ
 void
-MR_schedule_spark_globally(MR_Spark *spark)
+MR_schedule_spark_globally(const MR_Spark *proto_spark)
 {
     MR_LOCK(&MR_runqueue_lock, "schedule_spark_globally");
-    if (MR_spark_queue_tail) {
-        MR_spark_queue_tail->MR_spark_next = spark;
-        MR_spark_queue_tail = spark;
-    } else {
-        MR_spark_queue_head = spark;
-        MR_spark_queue_tail = spark;
-    }
+    MR_wsdeque_push_bottom(&MR_spark_queue, proto_spark);
     MR_num_outstanding_contexts_and_sparks++;
-  #ifdef MR_THREAD_SAFE
     MR_SIGNAL(&MR_runqueue_cond);
-  #endif
     MR_UNLOCK(&MR_runqueue_lock, "schedule_spark_globally");
 }
+#endif /* !MR_LL_PARALLEL_CONJ */
 
 
+#ifndef MR_HIGHLEVEL_CODE
+
 MR_define_extern_entry(MR_do_runnext);
 
 MR_BEGIN_MODULE(scheduler_module)
@@ -508,7 +525,7 @@
 {
     MR_Context      *tmp;
     MR_Context      *prev;
-    MR_Spark        *spark;
+    MR_Spark        spark;
     unsigned        depth;
     MercuryThread   thd;
 
@@ -516,8 +533,12 @@
     ** If this engine is holding onto a context, the context should not be
     ** in the middle of running some code.
     */
-    assert(MR_ENGINE(MR_eng_this_context) == NULL ||
-        MR_ENGINE(MR_eng_this_context)->MR_ctxt_spark_stack == NULL);
+    MR_assert(
+        MR_ENGINE(MR_eng_this_context) == NULL
+    ||
+        MR_wsdeque_is_empty(
+            &MR_ENGINE(MR_eng_this_context)->MR_ctxt_spark_deque)
+    );
 
     depth = MR_ENGINE(MR_eng_c_depth);
     thd = MR_ENGINE(MR_eng_owner_thread);
@@ -560,9 +581,8 @@
             tmp = tmp->MR_ctxt_next;
         }
 
-        /* Check if the spark queue is nonempty. */
-        spark = MR_spark_queue_head;
-        if (spark != NULL) {
+        /* Check if the global spark queue is nonempty. */
+        if (MR_wsdeque_take_top(&MR_spark_queue, &spark)) {
             MR_num_idle_engines--;
             MR_num_outstanding_contexts_and_sparks--;
             goto ReadySpark;
@@ -595,10 +615,6 @@
 
   ReadySpark:
 
-    MR_spark_queue_head = spark->MR_spark_next;
-    if (MR_spark_queue_tail == spark) {
-        MR_spark_queue_tail = NULL;
-    }
     MR_UNLOCK(&MR_runqueue_lock, "MR_do_runnext (iii)");
 
     /* Grab a new context if we haven't got one then begin execution. */
@@ -607,18 +623,13 @@
             MR_CONTEXT_SIZE_SMALL, NULL);
         MR_load_context(MR_ENGINE(MR_eng_this_context));
     }
-    MR_parent_sp = spark->MR_spark_parent_sp;
-    MR_SET_THREAD_LOCAL_MUTABLES(spark->MR_spark_thread_local_mutables);
-    MR_GOTO(spark->MR_spark_resume);
+    MR_parent_sp = spark.MR_spark_parent_sp;
+    MR_assert(MR_parent_sp != MR_sp);
+    MR_SET_THREAD_LOCAL_MUTABLES(spark.MR_spark_thread_local_mutables);
+    MR_GOTO(spark.MR_spark_resume);
 }
 #else /* !MR_THREAD_SAFE */
 {
-    /*
-    ** We don't support actually putting things in the global spark queue
-    ** in these grades.
-    */
-    assert(MR_spark_queue_head == NULL);
-
     if (MR_runqueue_head == NULL && MR_pending_contexts == NULL) {
         MR_fatal_error("empty runqueue!");
     }
Index: ./runtime/mercury_wsdeque.c
===================================================================
RCS file: ./runtime/mercury_wsdeque.c
diff -N ./runtime/mercury_wsdeque.c
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ ./runtime/mercury_wsdeque.c	4 Sep 2007 07:30:35 -0000
@@ -0,0 +1,145 @@
+/*
+** vim: ts=4 sw=4 expandtab
+*/
+/*
+** Copyright (C) 2007 The University of Melbourne.
+** This file may only be copied under the terms of the GNU Library General
+** Public License - see the file COPYING.LIB in the Mercury distribution.
+*/
+
+/*
+** mercury_wsdeque.c
+**
+** This file implements the basic algorithm from David Chase, Yossi Lev:
+** "Dynamic circular work-stealing deque".  SPAA 2005: 21-28.
+**
+** A work-stealing deque is a double ended queue in which only one thread can
+** access one end of the queue (the bottom) while other threads can only pop
+** elements from the other end (the top).
+**
+** We haven't implemented work stealing yet so the data structure is currently
+** only used as a stack (for context-local sparks) and as a queue (for the
+** global spark queue).
+**
+** NOTE: we need to insert memory barriers in the right places once we do start
+** work stealing.
+*/
+
+#include "mercury_imp.h"
+#include "mercury_memory_handlers.h"
+#include "mercury_context.h"
+#include "mercury_wsdeque.h"
+
+/*---------------------------------------------------------------------------*/
+
+#ifdef MR_LL_PARALLEL_CONJ
+
+static MR_SparkArray *
+MR_alloc_spark_array(MR_Integer size);
+
+void
+MR_init_wsdeque(MR_SparkDeque *dq, MR_Integer size)
+{
+    dq->MR_sd_bottom = 0;
+    dq->MR_sd_top = 0;
+    if (dq->MR_sd_active_array == NULL) {
+        /* The context might already have a deque if it is being recycled. */
+        dq->MR_sd_active_array = MR_alloc_spark_array(size);
+    }
+}
+
+MR_bool
+MR_wsdeque_is_empty(const MR_SparkDeque *dq)
+{
+    return dq->MR_sd_bottom == dq->MR_sd_top;
+}
+
+void
+MR_wsdeque_putback_bottom(MR_SparkDeque *dq, const MR_Spark *spark)
+{
+    MR_Integer              bot;
+    volatile MR_SparkArray  *arr;
+
+    bot = dq->MR_sd_bottom;
+    arr = dq->MR_sd_active_array;
+
+    MR_sa_element(arr, bot) = *spark;
+    dq->MR_sd_bottom = bot + 1;
+}
+
+int
+MR_wsdeque_steal_top(MR_SparkDeque *dq, MR_Spark *ret_spark)
+{
+    MR_Integer              top;
+    MR_Integer              bot;
+    volatile MR_SparkArray  *arr;
+    MR_Integer              size;
+
+    top = dq->MR_sd_top;
+    bot = dq->MR_sd_bottom;
+    arr = dq->MR_sd_active_array;
+    size = bot - top;
+
+    if (size <= 0) {
+        return 0;   /* empty */
+    }
+
+    *ret_spark = MR_sa_element(arr, top);
+    if (!MR_compare_and_swap_word(&dq->MR_sd_top, top, top + 1)) {
+        return -1;  /* abort */
+    }
+
+    return 1;       /* success */
+}
+
+int
+MR_wsdeque_take_top(MR_SparkDeque *dq, MR_Spark *ret_spark)
+{
+    MR_Integer              top;
+    MR_Integer              bot;
+    volatile MR_SparkArray  *arr;
+    MR_Integer              size;
+
+    top = dq->MR_sd_top;
+    bot = dq->MR_sd_bottom;
+    arr = dq->MR_sd_active_array;
+
+    size = bot - top;
+    if (size <= 0) {
+        return 0;   /* empty */
+    }
+
+    *ret_spark = MR_sa_element(arr, top);
+    dq->MR_sd_top = top + 1;
+    return 1;       /* success */
+}
+
+static MR_SparkArray *
+MR_alloc_spark_array(MR_Integer size)
+{
+    MR_SparkArray *arr;
+
+    arr = MR_GC_malloc(sizeof(MR_SparkArray) + (size - 1) * sizeof(MR_Spark));
+    arr->MR_sa_max = size - 1;
+    return arr;
+}
+
+MR_SparkArray *
+MR_grow_spark_array(const MR_SparkArray *old_arr, MR_Integer bot,
+        MR_Integer top)
+{
+    MR_Integer      new_size;
+    MR_SparkArray    *new_arr;
+    MR_Integer      i;
+
+    new_size = 2 * (old_arr->MR_sa_max + 1);
+    new_arr = MR_alloc_spark_array(new_size);
+
+    for (i = top; i < bot; i++) {
+        MR_sa_element(new_arr, i) = MR_sa_element(old_arr, i);
+    }
+
+    return new_arr;
+}
+
+#endif /* !MR_LL_PARALLEL_CONJ */
Index: ./runtime/mercury_mm_own_stacks.c
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_mm_own_stacks.c,v
retrieving revision 1.11
diff -u -r1.11 mercury_mm_own_stacks.c
--- ./runtime/mercury_mm_own_stacks.c	17 Apr 2007 05:38:09 -0000	1.11
+++ ./runtime/mercury_mm_own_stacks.c	4 Sep 2007 07:30:35 -0000
@@ -558,7 +558,11 @@
             generator);
         MR_copy_eng_this_context_fields(ctxt, MR_ENGINE(MR_eng_this_context));
         ctxt->MR_ctxt_next = NULL;
-        ctxt->MR_ctxt_spark_stack = NULL;
+#ifdef MR_LL_PARALLEL_CONJ
+        ctxt->MR_ctxt_spark_deque.MR_sd_active_array = NULL;
+        MR_init_wsdeque(&ctxt->MR_ctxt_spark_deque,
+            MR_INITIAL_LOCAL_SPARK_DEQUE_SIZE);
+#endif
     }
 
     ctxt->MR_ctxt_owner_generator = generator;
Index: ./runtime/mercury_wsdeque.h
===================================================================
RCS file: ./runtime/mercury_wsdeque.h
diff -N ./runtime/mercury_wsdeque.h
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ ./runtime/mercury_wsdeque.h	4 Sep 2007 07:30:35 -0000
@@ -0,0 +1,158 @@
+/*
+** vim: ts=4 sw=4 expandtab
+*/
+/*
+** Copyright (C) 2007 The University of Melbourne.
+** This file may only be copied under the terms of the GNU Library General
+** Public License - see the file COPYING.LIB in the Mercury distribution.
+*/
+
+#ifndef MERCURY_WSDEQUE_H
+#define MERCURY_WSDEQUE_H
+
+#ifdef MR_LL_PARALLEL_CONJ
+
+#include "mercury_atomic.h"
+
+/* XXX should experiment with these */
+#define MR_INITIAL_GLOBAL_SPARK_QUEUE_SIZE  4
+#define MR_INITIAL_LOCAL_SPARK_DEQUE_SIZE   8
+
+/*---------------------------------------------------------------------------*/
+
+/* See mercury_context.h for the definition of MR_SparkDeque. */
+
+struct MR_SparkArray_Struct {
+    MR_Integer          MR_sa_max;          /* power of two - 1 */
+    volatile MR_Spark   MR_sa_segment[1];   /* really MR_sa_max + 1 */
+};
+
+/*
+** MR_sa_element(Array, Pos)
+** Index into Array modulo its size, i.e. treating it as a circular array.
+**
+** MR_sa_max is a power of two - 1 so that we can use a bitwise AND operation
+** operation instead of modulo when indexing into the array, which makes a
+** significant difference.
+*/
+#define MR_sa_element(arr, pos)     (arr->MR_sa_segment[pos & arr->MR_sa_max])
+
+/*---------------------------------------------------------------------------*/
+
+/*
+** Initialise a spark deque.  A new circular array underlying the deque will
+** only be allocated if deque->MR_sd_active_array is NULL, otherwise only the
+** indices into the array will be reset.  `size' must be a power of two.
+*/
+extern  void    MR_init_wsdeque(MR_SparkDeque *dq, MR_Integer size);
+
+/*
+** Return true if the deque is empty.
+*/
+extern  MR_bool MR_wsdeque_is_empty(const MR_SparkDeque *dq);
+
+/*
+** Push a spark on the bottom of the deque.  Must only be called by the owner
+** of the deque.  The deque may grow as necessary.
+*/
+MR_EXTERN_INLINE
+void            MR_wsdeque_push_bottom(MR_SparkDeque *dq,
+                    const MR_Spark *spark);
+
+/*
+** Same as MR_wsdeque_push_bottom but assume that there is enough space
+** in the deque.  Should only be called after a successful pop.
+*/
+extern  void    MR_wsdeque_putback_bottom(MR_SparkDeque *dq,
+                    const MR_Spark *spark);
+
+/*
+** Pop a spark off the bottom of the deque.  Must only be called by
+** the owner of the deque.  Returns true if successful.
+*/
+MR_EXTERN_INLINE
+MR_bool         MR_wsdeque_pop_bottom(MR_SparkDeque *dq, MR_Spark *ret_spark);
+
+/*
+** Attempt to steal a spark from the top of the deque.
+**
+** Returns:
+**   1 on success,
+**   0 if the deque is empty or
+**  -1 if the steal was aborted due to a concurrent steal or pop_bottom.
+*/
+extern  int     MR_wsdeque_steal_top(MR_SparkDeque *dq, MR_Spark *ret_spark);
+
+/*
+** Take a spark from the top of the deque, assuming there are no concurrent
+** operations on the deque.  Returns true on success.
+*/
+extern  int     MR_wsdeque_take_top(MR_SparkDeque *dq, MR_Spark *ret_spark);
+
+/*
+** Return a new circular array with double the capacity of the old array.
+** The valid elements of the old array are copied to the new array.
+*/
+extern  MR_SparkArray * MR_grow_spark_array(const MR_SparkArray *old_arr,
+                            MR_Integer bot, MR_Integer top);
+
+/*---------------------------------------------------------------------------*/
+
+MR_EXTERN_INLINE void
+MR_wsdeque_push_bottom(MR_SparkDeque *dq, const MR_Spark *spark)
+{
+    MR_Integer              bot;
+    MR_Integer              top;
+    volatile MR_SparkArray  *arr;
+    MR_Integer              size;
+    
+    bot = dq->MR_sd_bottom;
+    top = dq->MR_sd_top;
+    arr = dq->MR_sd_active_array;
+    size = bot - top;
+
+    if (size >= arr->MR_sa_max) {
+        arr = MR_grow_spark_array((MR_SparkArray *) arr, bot, top);
+        dq->MR_sd_active_array = arr;
+    }
+
+    MR_sa_element(arr, bot) = *spark;
+    dq->MR_sd_bottom = bot + 1;
+}
+
+MR_EXTERN_INLINE MR_bool
+MR_wsdeque_pop_bottom(MR_SparkDeque *dq, MR_Spark *ret_spark)
+{
+    MR_Integer              bot;
+    MR_Integer              top;
+    MR_Integer              size;
+    volatile MR_SparkArray  *arr;
+    MR_bool                 success;
+
+    bot = dq->MR_sd_bottom;
+    arr = dq->MR_sd_active_array;
+    bot--;
+    dq->MR_sd_bottom = bot;
+
+    top = dq->MR_sd_top;
+    size = bot - top;
+
+    if (size < 0) {
+        dq->MR_sd_bottom = top;
+        return MR_FALSE;
+    }
+
+    *ret_spark = MR_sa_element(arr, bot);
+    if (size > 0) {
+        return MR_TRUE;
+    }
+
+    /* size = 0 */
+    success = MR_compare_and_swap_word(&dq->MR_sd_top, top, top + 1);
+    dq->MR_sd_bottom = top + 1;
+    return success;
+}
+
+#endif /* !MR_LL_PARALLEL_CONJ */
+
+#endif /* !MERCURY_WSDEQUE_H */
Index: ./runtime/mercury_engine.h
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_engine.h,v
retrieving revision 1.48
diff -u -r1.48 mercury_engine.h
--- ./runtime/mercury_engine.h	10 May 2007 05:24:16 -0000	1.48
+++ ./runtime/mercury_engine.h	4 Sep 2007 07:30:35 -0000
@@ -373,7 +373,7 @@
     MR_Word             *MR_eng_sol_hp;
     MR_Word             *MR_eng_global_hp;
 #endif
-#ifdef  MR_THREAD_SAFE
+#ifdef  MR_LL_PARALLEL_CONJ
     MR_Word             *MR_eng_parent_sp;
 #endif
     MR_Context          *MR_eng_this_context;
Index: ./runtime/mercury_conf_param.h
===================================================================
RCS file: /home/mercury/mercury1/repository/mercury/runtime/mercury_conf_param.h,v
retrieving revision 1.98
diff -u -r1.98 mercury_conf_param.h
--- ./runtime/mercury_conf_param.h	31 Jul 2007 07:58:43 -0000	1.98
+++ ./runtime/mercury_conf_param.h	4 Sep 2007 07:30:35 -0000
@@ -105,7 +105,7 @@
 **	use inline functions rather than macros for a few builtins.
 **
 ** MR_THREAD_SAFE
-**	Enable support for parallelism [not yet working].
+**	Enable support for parallelism.
 **
 ** MR_NO_BACKWARDS_COMPAT
 **	Disable backwards compatibility with C code using obsolete low-level
Index: ./runtime/mercury_atomic.h
===================================================================
RCS file: ./runtime/mercury_atomic.h
diff -N ./runtime/mercury_atomic.h
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ ./runtime/mercury_atomic.h	4 Sep 2007 07:30:35 -0000
@@ -0,0 +1,80 @@
+/*
+** vim:ts=4 sw=4 expandtab
+*/
+/*
+** Copyright (C) 2007 The University of Melbourne.
+** This file may only be copied under the terms of the GNU Library General
+** Public License - see the file COPYING.LIB in the Mercury distribution.
+*/
+
+/*
+** mercury_atomic.h - defines atomic operations.
+*/
+
+#ifndef MERCURY_ATOMIC_H
+#define MERCURY_ATOMIC_H
+
+/*
+** If the value at addr is equal to old, assign new to addr and return true.
+** Otherwise return false.
+*/
+MR_EXTERN_INLINE MR_bool
+MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
+        MR_Integer new_val) ;
+
+/*---------------------------------------------------------------------------*/
+
+#if defined(__GNUC__) && defined(__x86_64__)
+
+    MR_EXTERN_INLINE MR_bool
+    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
+            MR_Integer new_val) 
+    {
+        char result;
+
+        __asm__ __volatile__(
+            "lock; cmpxchgq %3, %0; setz %1"
+            : "=m"(*addr), "=q"(result)
+            : "m"(*addr), "r" (new_val), "a"(old)
+            : "memory"
+        );
+        return (int) result;
+    }
+
+#elif defined(__GNUC__) && defined(__i386__)
+
+    /* Really 486 or better. */
+    MR_EXTERN_INLINE MR_bool
+    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
+            MR_Integer new_val) 
+    {
+        char result;
+
+        __asm__ __volatile__(
+            "lock; cmpxchgl %3, %0; setz %1"
+            : "=m"(*addr), "=q"(result)
+            : "m"(*addr), "r" (new_val), "a"(old)
+            : "memory");
+        return (int) result;
+    }
+
+#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+    /* gcc 4.1 and above have builtin atomic operations. */
+
+    MR_EXTERN_INLINE MR_bool
+    MR_compare_and_swap_word(volatile MR_Integer *addr, MR_Integer old,
+            MR_Integer new_val) 
+    {
+        return __sync_bool_compare_and_swap(addr, old, new_val);
+    }
+
+#endif
+
+/*
+** If we don't have definitions available for this compiler or architecture
+** then we will get a link error in low-level .par grades.  No other grades
+** currently require on atomic ops.
+*/
+
+#endif /* not MERCURY_ATOMIC_H */

--------------------------------------------------------------------------
mercury-reviews mailing list
Post messages to:       mercury-reviews at csse.unimelb.edu.au
Administrative Queries: owner-mercury-reviews at csse.unimelb.edu.au
Subscriptions:          mercury-reviews-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the reviews mailing list