[m-rev.] For post-commit review: Automatic Parallelisation Improvements.

Paul Bone pbone at csse.unimelb.edu.au
Wed Jul 14 10:42:28 AEST 2010


For post-commit review by Zoltan.

---

Automatic Parallelisation Improvements.

Factor in all the costs of parallelistion into the parallel overlap estimation
algorithm.  Previously only some costs where being taken into consideration.
Independent parallelsations are now generally preferred as they have fewer
overheads for similar parallelsations.

Generalised the branch and bound search algorithm into a new Mercury module.

mdbcomp/feedback.m:
    Grouped candidate parallel conjunction parameters into a single type.

    Added extra parameters:
        future_signal_cost
        future_wait_cost
        context_wakeup_delay.
    The first two replace locking cost, they are the costs of the signal and
    wait calls for futures respectively.  The third represents the length of
    time for a context to begin executing after it has been placed on the run
    queue.  It is used to estimate the cost of blocking.

    Refactored the parallel_exec_metrics type to make representing overheads easier.

    Modify parallel_exec_metrics so that it can represent the cost of calling
    signal in the left conjunct of any conjunct pair.

    Modify parallel_exec_metrics so that it stores the parallel execution time
    of the initial (leftmost) conjunct.  This is necessary as the parallel
    execution time includes the cost of the 'fork' call of the next conjunct.

    Modify parallel_exec_metrics to record the cost of blocking for the
    leftmost conjunct if it completes before the parallel conjunction completes
    as a whole.

    Increment the feedback file format version number.

compiler/implicit_parallelism.m:
    Conform to changes in mdbcomp/feedback.m.

deep_profiler/branch_and_bound.m:
    A generic branch and bound solver loop and utilities.

    The modified branch and bound code includes a profiling facility.

deep_profiler/Mercury.options:
    The new branch_and_bound module supports the debug_branch_and_bound trace
    flag. 

deep_profiler/mdprof_fb.automatic_parallelism.m:
    Generalise and move branch and bound code to branch_and_bound.m

    Removed the candidate_parallel_conjunctions_opts type, we now use the
    candidate_par_conjunctions_params type in its place.

    Modify the code for parallelising conjunctions so that it works with lists
    of goals rather than cords of goals.

    Factor out the code tha looks for the next costly call, this is now handled
    by a preprocessing pass so that it has linear time rather than increasing
    the complexity of the search code.

    Documented some predicates in more detail.

deep_profiler/mdprof_feedback.m:
    Conform to changes in deep_profiler/mdprof_fb.automatic_parallelism.m and
    mdbcomp/feedback.m

    Add command line support for the new candidate parallel conjunctions
    feedback parameters.

Index: compiler/implicit_parallelism.m
===================================================================
RCS file: /home/mercury1/repository/mercury/compiler/implicit_parallelism.m,v
retrieving revision 1.18
diff -u -p -b -r1.18 implicit_parallelism.m
--- compiler/implicit_parallelism.m	4 Jul 2010 10:24:08 -0000	1.18
+++ compiler/implicit_parallelism.m	14 Jul 2010 00:29:54 -0000
@@ -160,19 +160,7 @@ apply_new_implicit_parallelism_transform
     %
 :- type parallelism_info
     --->    parallelism_info(
-                pi_desired_parallelism  :: float,
-                    % The number of desired busy sparks.
-
-                pi_sparking_cost        :: int,
-                    % The cost of creating a spark in call sequence counts.
-
-                pi_sparking_delay       :: int,
-                    % The time it takes for a spark to be created, stolen and
-                    % begin execution.
-
-                pi_locking_cost         :: int,
-                    % The cost of maintaining a lock on a single dependant
-                    % variable in call sequence counts.
+                pi_parameters           :: candidate_par_conjunctions_params,
 
                 pi_cpc_map              :: module_candidate_par_conjs_map
                     % A map of candidate parallel conjunctions in this module
@@ -202,15 +190,13 @@ apply_new_implicit_parallelism_transform
 
 get_implicit_parallelism_feedback(ModuleName, FeedbackInfo, ParallelismInfo) :-
     FeedbackData = 
-        feedback_data_candidate_parallel_conjunctions(_, _, _, _, _),
+        feedback_data_candidate_parallel_conjunctions(_, _),
     get_feedback_data(FeedbackInfo, FeedbackData),
-    FeedbackData = feedback_data_candidate_parallel_conjunctions(
-        DesiredParallelism, SparkingCost, SparkingDelay, LockingCost,
-        AssocList), 
+    FeedbackData = 
+        feedback_data_candidate_parallel_conjunctions(Parameters, AssocList),
     make_module_candidate_par_conjs_map(ModuleName, AssocList,
         CandidateParConjsMap),
-    ParallelismInfo = parallelism_info(DesiredParallelism, SparkingCost,
-        SparkingDelay, LockingCost, CandidateParConjsMap).
+    ParallelismInfo = parallelism_info(Parameters, CandidateParConjsMap).
 
 :- pred make_module_candidate_par_conjs_map(module_name::in,
     assoc_list(string_proc_label, candidate_par_conjunctions_proc)::in,
Index: deep_profiler/Mercury.options
===================================================================
RCS file: /home/mercury1/repository/mercury/deep_profiler/Mercury.options,v
retrieving revision 1.14
diff -u -p -b -r1.14 Mercury.options
--- deep_profiler/Mercury.options	8 Jun 2010 13:00:51 -0000	1.14
+++ deep_profiler/Mercury.options	14 Jul 2010 00:29:54 -0000
@@ -37,7 +37,10 @@ MCFLAGS-startup = 	--no-optimize-duplica
 MCFLAGS-top_procs = 	--no-optimize-duplicate-calls
 
 # Uncomment this to debug the automatic parallelism code.
+#MCFLAGS-branch_and_bound = \
+#	--trace-flag=debug_branch_and_bound
 #MCFLAGS-mdprof_fb.automatic_parallelism = \
+#	--trace-flag=debug_branch_and_bound \
 #	--trace-flag=debug_cpc_search \
 #	--trace-flag=debug_recursive_costs \
 #	--trace-flag=debug_parallel_conjunction_speedup \
Index: deep_profiler/branch_and_bound.m
===================================================================
RCS file: deep_profiler/branch_and_bound.m
diff -N deep_profiler/branch_and_bound.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ deep_profiler/branch_and_bound.m	14 Jul 2010 00:29:54 -0000
@@ -0,0 +1,288 @@
+%-----------------------------------------------------------------------------%
+% vim: ft=mercury ts=4 sw=4 et
+%-----------------------------------------------------------------------------%
+% Copyright (C) 2010 The University of Melbourne.
+% This file may only be copied under the terms of the GNU General
+% Public License - see the file COPYING in the Mercury distribution.
+%-----------------------------------------------------------------------------%
+%
+% File: branch_and_bound.m.
+% Author: pbone.
+%
+% This module contains a generic branch and bound solver.  It is designed to be
+% generic and easy to use.  To use it write non-deterministic search code that
+% uses test_incomplete_solution after every choice point.  Call this code using
+% branch_and_bound. 
+%
+% This module may be compiled with the debug_branch_and_bound trace flag to
+% enable the debugging trace goals.
+%
+%-----------------------------------------------------------------------------%
+
+:- module branch_and_bound.
+:- interface.
+
+:- import_module int.
+:- import_module float.
+:- import_module set.
+
+%-----------------------------------------------------------------------------%
+
+    % The state of the branch and bound solver.  This structure contains
+    % mutable that are not reverted during backtracking.  This is how the
+    % solver maintains the current best value for the objective function.
+    % This means that only one value of this type needs to be passed into the
+    % search code.
+    %
+:- type bnb_state(T).
+
+    % Profiling information for an execution of the solver.
+    %
+:- type bnb_profile
+    --->    bnb_profile(
+                bnbp_tests_succeeded        :: int,
+                bnbp_tests_failed           :: int,
+                bnbp_new_best_solution      :: int,
+                bnbp_new_equal_solution     :: int,
+                bnbp_not_best_solution      :: int,
+                bnbp_time_msecs             :: int
+            ).
+
+    % branch_and_bound(GenerateSolutions, ObjectiveFn, BestSolutions).
+    %
+    % Use a branch and bound search to return the set of BestSolutions
+    % according to the ObjectiveFn that GenerateSolutions can generate.
+    %
+    % Note that more optimal solutions return _smaller_ values from ObjectiveFn.
+    %
+    % The set of best solutions is returned, it is up to the caller to break
+    % ties if necessary.
+    %
+:- pred branch_and_bound(semipure pred(bnb_state(T), T),
+    func(T) = float, set(T), bnb_profile).
+:- mode branch_and_bound(pred(in, out) is nondet, 
+    func(in) = out is det, out, out) is det.
+
+    % test_incomplete_solution(State, PartialSolution).
+    %
+    % This is true if PartialSolution is not worse than the current best
+    % solution.  It does not update the best solution.  Programmers should use
+    % this after every choice point to avoid searching non-optimistic parts of
+    % their search tree.
+    %
+:- semipure pred test_incomplete_solution(bnb_state(T)::in, T::in) is semidet.
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+
+:- implementation.
+
+:- import_module benchmarking.
+:- import_module io.
+:- import_module list.
+:- import_module mutvar.
+:- import_module pair.
+:- import_module require.
+:- import_module string.
+:- import_module unit.
+
+%-----------------------------------------------------------------------------%
+    
+    % The best solutions found so far and the value of the objective
+    % function for these solutions.
+    %
+:- type best_solutions(T)
+    --->    no_best_solutions
+    ;       best_solutions(
+                bs_solutions            :: list(T),
+                bs_objective_value      :: float
+                    % Note that the solver tries to minimise this value.  That
+                    % is smaller numbers are more optimal.
+            ).
+
+%-----------------------------------------------------------------------------%
+
+:- type bnb_state(T)
+    --->    bnb_state(
+                best_solutions_mutable      :: mutvar(best_solutions(T)),
+                objective_function          :: func(T) = float,
+                profile                     :: mutvar(bnb_profile)
+            ).
+
+:- inst bnb_state
+    --->    bnb_state(ground, func(in) = out is det, ground).
+
+%-----------------------------------------------------------------------------%
+
+branch_and_bound(Generator, ObjectiveFn, BestSolutions, Profile) :-
+    promise_equivalent_solutions [Time, BestSolutions, Profile0] (
+        benchmark_det(branch_and_bound_2(Generator, ObjectiveFn), unit,
+            (BestSolutions - Profile0), 1, Time)
+    ),
+    Profile = Profile0 ^ bnbp_time_msecs := Time.
+
+:- pred branch_and_bound_2(semipure pred(bnb_state(T), T), 
+    func(T) = float, unit, 
+    pair(set(T), bnb_profile)).
+:- mode branch_and_bound_2(pred(in, out) is nondet, 
+    func(in) = out is det, in, out) is det.
+
+branch_and_bound_2(Generator, ObjectiveFn, unit, FinalBestSolutions - FinalProfile) :-
+    % Use a failure driven loop to implement a branch and bound solver.
+    promise_pure (
+        trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
+            io.write_string("D: Branch and bound loop starting\n", !IO)
+        ),
+        impure new_mutvar(no_best_solutions, BestSolutionsMutvar),
+        impure new_mutvar(new_bnb_profile, ProfileMutvar),
+        State = bnb_state(BestSolutionsMutvar, ObjectiveFn, ProfileMutvar),
+        (
+            semipure Generator(State, CurrentSolution),
+
+            impure test_complete_solution(State, CurrentSolution), 
+            trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
+                CurrentObjective = ObjectiveFn(CurrentSolution),
+                io.format(
+                    "D: Branch and bound: Solution found with objective: %f\n",
+                    [f(CurrentObjective)], !IO)
+            ),
+
+            semidet_fail
+        ->
+            error(this_file ++ "Failure driven loop must fail")
+        ;
+            true
+        ),
+
+        % Return results.
+        impure get_mutvar(BestSolutionsMutvar, FinalBestSolutions0), 
+        impure get_mutvar(ProfileMutvar, FinalProfile),
+        trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
+            io.write_string("D: Branch and bound loop finished\n", !IO)
+        ),
+        (
+            FinalBestSolutions0 = no_best_solutions,
+            FinalBestSolutions = set.init
+        ;
+            FinalBestSolutions0 = best_solutions(Solutions, _),
+            FinalBestSolutions = set.from_list(Solutions)
+        )
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % test_complete_solution(State, Solution).
+    %
+    % True of Solution is the best or equal best solution so far.
+    %
+    % The current best solutions are updated.
+    % 
+:- impure pred test_complete_solution(bnb_state(T)::in, T::in) is semidet.
+
+test_complete_solution(State, CurrentSolution) :- 
+    State = bnb_state(BestSolutionsMutvar, ObjectiveFn, ProfileMutvar),
+    CurrentObjective = ObjectiveFn(CurrentSolution),
+   
+    impure get_mutvar(BestSolutionsMutvar, BestSolutions0),
+    impure get_mutvar(ProfileMutvar, Profile0),
+    (
+        (
+            BestSolutions0 = no_best_solutions,
+            BestSolutions = best_solutions([CurrentSolution],
+                CurrentObjective),
+            profile_new_best_solution(Profile0, Profile)
+        ;
+            BestSolutions0 = best_solutions(Solutions, BestObjective),
+            ( CurrentObjective < BestObjective ->
+                BestSolutions = best_solutions([CurrentSolution],
+                    CurrentObjective),
+                profile_new_best_solution(Profile0, Profile)
+            ; CurrentObjective = BestObjective ->
+                BestSolutions = best_solutions(
+                    [CurrentSolution | Solutions], BestObjective),
+                profile_equal_best_solution(Profile0, Profile)
+            ;
+                fail
+            )
+        )
+    ->
+        % If this solution is best or equal best.
+        impure set_mutvar(BestSolutionsMutvar, BestSolutions),
+        impure set_mutvar(ProfileMutvar, Profile)
+    ;
+        % If this solution is not better.
+        profile_not_best_solution(Profile0, Profile),
+        impure set_mutvar(ProfileMutvar, Profile),
+        semidet_fail
+    ).
+
+%-----------------------------------------------------------------------------%
+
+test_incomplete_solution(State, Solution) :-
+    State = bnb_state(BestSolutionsMutvar, ObjectiveFn, ProfileMutvar),
+    promise_semipure (
+        impure get_mutvar(BestSolutionsMutvar, BestSolutions),
+        impure get_mutvar(ProfileMutvar, Profile0),
+        (
+            (
+                BestSolutions = no_best_solutions
+            ;
+                BestSolutions = best_solutions(_, BestObjective),
+                CurrentObjective = ObjectiveFn(Solution),
+                CurrentObjective =< BestObjective
+            )
+        ->
+            profile_test_succeeded(Profile0, Profile),
+            impure set_mutvar(ProfileMutvar, Profile)
+        ;
+            profile_test_failed(Profile0, Profile),
+            impure set_mutvar(ProfileMutvar, Profile),
+            semidet_fail
+        )
+    ).
+
+%-----------------------------------------------------------------------------%
+
+:- func new_bnb_profile = bnb_profile.
+
+new_bnb_profile = bnb_profile(0, 0, 0, 0, 0, 0).
+
+:- pred profile_new_best_solution(bnb_profile::in, bnb_profile::out) is det.
+
+profile_new_best_solution(Profile0, Profile) :-
+    Profile = Profile0 ^ bnbp_new_best_solution :=
+        Profile0 ^ bnbp_new_best_solution + 1.
+
+:- pred profile_equal_best_solution(bnb_profile::in, bnb_profile::out) is det.
+
+profile_equal_best_solution(Profile0, Profile) :-
+    Profile = Profile0 ^ bnbp_new_equal_solution :=
+        Profile0 ^ bnbp_new_equal_solution + 1.
+
+:- pred profile_not_best_solution(bnb_profile::in, bnb_profile::out) is det.
+
+profile_not_best_solution(Profile0, Profile) :-
+    Profile = Profile0 ^ bnbp_not_best_solution :=
+        Profile0 ^ bnbp_not_best_solution + 1.
+
+:- pred profile_test_succeeded(bnb_profile::in, bnb_profile::out) is det.
+
+profile_test_succeeded(Profile0, Profile) :-
+    Profile = Profile0 ^ bnbp_tests_succeeded :=
+        Profile0 ^ bnbp_tests_succeeded + 1.
+
+:- pred profile_test_failed(bnb_profile::in, bnb_profile::out) is det.
+
+profile_test_failed(Profile0, Profile) :-
+    Profile = Profile0 ^ bnbp_tests_failed :=
+        Profile0 ^ bnbp_tests_failed + 1.
+
+%-----------------------------------------------------------------------------%
+
+:- func this_file = string.
+
+this_file = "branch_and_bound.m: ".
+
+%-----------------------------------------------------------------------------%
+:- end_module branch_and_bound.
+%-----------------------------------------------------------------------------%
Index: deep_profiler/mdprof_fb.automatic_parallelism.m
===================================================================
RCS file: /home/mercury1/repository/mercury/deep_profiler/mdprof_fb.automatic_parallelism.m,v
retrieving revision 1.9
diff -u -p -b -r1.9 mdprof_fb.automatic_parallelism.m
--- deep_profiler/mdprof_fb.automatic_parallelism.m	4 Jul 2010 10:24:09 -0000	1.9
+++ deep_profiler/mdprof_fb.automatic_parallelism.m	14 Jul 2010 00:35:49 -0000
@@ -26,7 +26,6 @@
 
 :- import_module cord.
 :- import_module int.
-:- import_module float.
 :- import_module pair.
 
 %-----------------------------------------------------------------------------%
@@ -35,30 +34,9 @@
     % implicit parallelism.
     %
 :- pred candidate_parallel_conjunctions(
-    candidate_parallel_conjunctions_opts::in, deep::in, cord(message)::out,
+    candidate_par_conjunctions_params::in, deep::in, cord(message)::out,
     feedback_info::in, feedback_info::out) is det.
 
-:- type candidate_parallel_conjunctions_opts
-    --->    candidate_parallel_conjunctions_opts(
-                cpc_desired_parallelism     :: float,
-                cpc_sparking_cost           :: int,
-                    % The cost of calling MR_fork_new_child.
-
-                cpc_sparking_delay          :: int,
-                    % The time from the start of the call to MR_fork_new_child
-                    % until the spark has been stolen (and in some cases a
-                    % context has been created).
-
-                cpc_locking_cost            :: int,
-                cpc_clique_threshold        :: int,
-                cpc_call_site_threshold     :: int,
-                cpc_parallelise_dep_conjs   :: parallelise_dep_conjs 
-            ).
-    
-:- type parallelise_dep_conjs
-    --->    parallelise_dep_conjs
-    ;       do_not_parallelise_dep_conjs.
-
 %-----------------------------------------------------------------------------%
     
     % Perform Jerome's analysis and update the feedback info structure.
@@ -83,6 +61,7 @@
 
 :- implementation.
 
+:- import_module branch_and_bound.
 :- import_module coverage.
 :- import_module create_report.
 :- import_module measurement_units.
@@ -99,7 +78,6 @@
 :- import_module map.
 :- import_module maybe.
 :- import_module multi_map.
-:- import_module mutvar.
 :- import_module pqueue.
 :- import_module require.
 :- import_module set.
@@ -123,11 +101,7 @@
 %     Debug the branch and bound search for the best parallelisation.
 %
 
-candidate_parallel_conjunctions(Opts, Deep, Messages, !Feedback) :-
-    Opts = candidate_parallel_conjunctions_opts(DesiredParallelism,
-        SparkingCost, SparkingDelay, LockingCost, _CliqueThreshold,
-        _CallSiteThreshold, _ParalleliseDepConjs),
-
+candidate_parallel_conjunctions(Params, Deep, Messages, !Feedback) :-
     % Find opertunities for parallelism by walking the clique tree.  Don't
     % Descened into cliques cheaper than the threshold.
     deep_lookup_clique_index(Deep, Deep ^ root, RootCliquePtr),
@@ -137,7 +111,7 @@ candidate_parallel_conjunctions(Opts, De
     % exist.
     RootCliqueCost = build_cs_cost_csq(1, float(TotalCallseqs) + 1.0),
     RootParallelism = no_parallelism,
-    candidate_parallel_conjunctions_clique(Opts, Deep, RootCliqueCost,
+    candidate_parallel_conjunctions_clique(Params, Deep, RootCliqueCost,
         RootParallelism, RootCliquePtr, ConjunctionsMap, Messages),
 
     map.to_assoc_list(ConjunctionsMap, ConjunctionsAssocList0),
@@ -145,8 +119,8 @@ candidate_parallel_conjunctions(Opts, De
             pard_goal_detail_to_pard_goal),
         ConjunctionsAssocList0, ConjunctionsAssocList),
     CandidateParallelConjunctions =
-        feedback_data_candidate_parallel_conjunctions(DesiredParallelism,
-        SparkingCost, SparkingDelay, LockingCost, ConjunctionsAssocList),
+        feedback_data_candidate_parallel_conjunctions(Params,
+            ConjunctionsAssocList),
     put_feedback_data(CandidateParallelConjunctions, !Feedback).
 
 :- pred pard_goal_detail_to_pard_goal(pard_goal_detail::in, pard_goal::out) 
@@ -178,7 +152,7 @@ pard_goal_detail_annon_to_pard_goal_anno
     --->    implicit_parallelism_info(
                 ipi_deep            :: deep,
                 ipi_progrep         :: prog_rep,
-                ipi_opts            :: candidate_parallel_conjunctions_opts,
+                ipi_opts            :: candidate_par_conjunctions_params,
                 ipi_call_sites      :: map(goal_path, clique_call_site_report),
                 ipi_rec_call_sites  :: map(goal_path, cs_cost_csq),
                 ipi_var_table       :: var_table,
@@ -266,7 +240,7 @@ pard_goal_detail_annon_to_pard_goal_anno
     % than the desired amount of parallelism.
     %
 :- pred candidate_parallel_conjunctions_clique(
-    candidate_parallel_conjunctions_opts::in, deep::in, cs_cost_csq::in,
+    candidate_par_conjunctions_params::in, deep::in, cs_cost_csq::in,
     parallelism_amount::in, clique_ptr::in, candidate_par_conjunctions::out,
     cord(message)::out) is det.
 
@@ -364,9 +338,9 @@ make_clique_proc_map(CliqueProcs, Clique
     % CliquePtr is the clique that this proc belongs to.
     %
 :- pred candidate_parallel_conjunctions_clique_proc(
-    candidate_parallel_conjunctions_opts::in, deep::in, 
-    clique_is_recursive::in, cs_cost_csq::in, parallelism_amount::in,
-    set(proc_desc)::in, map(proc_desc, clique_proc_report)::in, clique_ptr::in,
+    candidate_par_conjunctions_params::in, deep::in, clique_is_recursive::in,
+    cs_cost_csq::in, parallelism_amount::in, set(proc_desc)::in, 
+    map(proc_desc, clique_proc_report)::in, clique_ptr::in,
     clique_proc_report::in, candidate_par_conjunctions::out,
     cord(message)::out) is det.
 
@@ -453,7 +427,7 @@ merge_candidate_par_conjs_proc(A, B, Res
     ).
 
 :- pred candidate_parallel_conjunctions_call_site(
-    candidate_parallel_conjunctions_opts::in, deep::in, set(proc_desc)::in,
+    candidate_par_conjunctions_params::in, deep::in, set(proc_desc)::in,
     clique_is_recursive::in, map(goal_path, cs_cost_csq)::in,
     map(proc_desc, clique_proc_report)::in, clique_ptr::in,
     parallelism_amount::in,
@@ -542,7 +516,7 @@ parallel_amount(SeqConjs, Conj, Parallel
     sub_computation_parallelism(Parallelism0, certain, Parallelism).
 
 :- pred candidate_parallel_conjunctions_callee(
-    candidate_parallel_conjunctions_opts::in, deep::in, set(proc_desc)::in,
+    candidate_par_conjunctions_params::in, deep::in, set(proc_desc)::in,
     clique_is_recursive::in, map(goal_path, cs_cost_csq)::in,
     map(proc_desc, clique_proc_report)::in, clique_ptr::in, call_site_desc::in,
     parallelism_amount::in, perf_row_data(clique_desc)::in,
@@ -585,7 +559,7 @@ candidate_parallel_conjunctions_callee(O
             % absolute way then do not recurse further.  This test is performed
             % here rather than in the callees of this predicate to avoid
             % duplication of code.
-            not cs_cost_get_total(CSCost) > float(Opts ^ cpc_clique_threshold)
+            not cs_cost_get_total(CSCost) > float(Opts ^ cpcp_clique_threshold)
         ->
             Candidates = map.init, 
             Messages = cord.empty,
@@ -601,7 +575,7 @@ candidate_parallel_conjunctions_callee(O
             % Also check check if the desired amount of parallelism has been
             % reached, if so do not recurse further to prevent creating too
             % much extra parallelism.
-            exceeded_desired_parallelism(Opts ^ cpc_desired_parallelism,
+            exceeded_desired_parallelism(Opts ^ cpcp_desired_parallelism,
                 Parallelism)
         ->
             Candidates = map.init, 
@@ -1094,9 +1068,8 @@ build_recursive_call_site_cost_map_call_
     % Find candidate parallel conjunctions within the given procedure.
     %
 :- pred candidate_parallel_conjunctions_proc(
-    candidate_parallel_conjunctions_opts::in, deep::in,
-    clique_proc_report::in, map(goal_path, cs_cost_csq)::in,
-    candidate_par_conjunctions::out,
+    candidate_par_conjunctions_params::in, deep::in, clique_proc_report::in,
+    map(goal_path, cs_cost_csq)::in, candidate_par_conjunctions::out,
     multi_map(goal_path_string, 
         candidate_par_conjunction(pard_goal_detail))::out,
     cord(message)::out) is det.
@@ -1346,14 +1319,13 @@ conj_build_candidate_conjunctions(Info, 
             append_message(Location,
                 info_found_conjs_above_callsite_threshold(NumCostlyCalls),
                 !Messages), 
-            build_dependency_maps(PardGoals, DependencyMaps),
             % We don't parallelise across non-atomic goals, so split a list of
             % pard goals into partitions where non-atomic goals seperate the
             % partitions.
             partition_pard_goals(Location, PardGoals, [], _, 
                 1, _NumPartitions, 0, _, [], PartitionedGoals, !Messages),
-            map(pardgoals_build_candidate_conjunction(Info, 
-                    DependencyMaps, Location, GoalPath), 
+            map(pardgoals_build_candidate_conjunction(Info, Location,
+                    GoalPath), 
                 PartitionedGoals, MaybeCandidates),
             filter_map(maybe_is_yes, MaybeCandidates, Candidates),
             append_message(Location,
@@ -1368,19 +1340,14 @@ conj_build_candidate_conjunctions(Info, 
 :- pred count_costly_calls(pard_goal_detail::in, int::in, int::out) is det.
 
 count_costly_calls(Goal, !NumCostlyCalls) :-
-    GoalType = Goal ^ goal_annotation ^ pgd_pg_type,
-    (
-        GoalType = pgt_call(_, CostAboveThreshold, _, _),
+    identify_costly_call(Goal, Costly),
         (
-            CostAboveThreshold = cost_above_par_threshold,
+        Costly = is_costly_goal,
             !:NumCostlyCalls = !.NumCostlyCalls + 1
         ;
-            CostAboveThreshold = cost_not_above_par_threshold
-        )
-    ;
-        GoalType = pgt_other_atomic_goal
+        Costly = is_not_costly_goal
     ;
-        GoalType = pgt_non_atomic_goal
+        Costly = is_non_atomic_goal
     ).
 
 :- pred partition_pard_goals(program_location::in, 
@@ -1441,12 +1408,11 @@ partition_pard_goals(Location, [ PG | PG
         !NumCostlyCalls, !Partitions, !Messages).
 
 :- pred pardgoals_build_candidate_conjunction(implicit_parallelism_info::in,
-    dependency_maps::in, program_location::in, goal_path::in,
-    pard_goals_partition::in,
+    program_location::in, goal_path::in, pard_goals_partition::in,
     maybe(candidate_par_conjunction(pard_goal_detail))::out) is det.
 
-pardgoals_build_candidate_conjunction(Info, DependencyMaps, Location,
-        GoalPath, GoalsPartition, MaybeCandidate) :-
+pardgoals_build_candidate_conjunction(Info, Location, GoalPath, GoalsPartition,
+        MaybeCandidate) :-
     % Setting up the first parallel conjunct is a different algorithm to the
     % latter ones, at this point we have the option of moving goals from before
     % the first costly call to either before or during the parallel
@@ -1454,21 +1420,16 @@ pardgoals_build_candidate_conjunction(In
     % efficient.  However if goals within other parallel conjuncts depend on
     % them and don't depend upon the first costly call then this would make the
     % conjunction dependant when it could be independent.
-    pard_goals_partition(GoalsList, PartNum) = GoalsPartition,
-    Goals = cord.from_list(GoalsList),
-    find_best_parallelisation(Info, Location, PartNum, DependencyMaps,
-        Goals, Parallelisation),
+    pard_goals_partition(Goals, PartNum) = GoalsPartition,
+    find_best_parallelisation(Info, Location, PartNum, Goals,
+        MaybeParallelisation),
     (
-        Parallelisation = bp_no_best_parallelisation,
+        MaybeParallelisation = no,
         MaybeCandidate = no
     ;
-        Parallelisation = bp_parallel_execution(GoalsBeforeCord, ParConjsCord,
-            GoalsAfterCord, IsDependent, Metrics),
+        MaybeParallelisation = yes(bp_parallel_execution(GoalsBefore, ParConjs,
+            GoalsAfter, IsDependent, Metrics)),
         Speedup = parallel_exec_metrics_get_speedup(Metrics),
-        GoalsBefore = list(GoalsBeforeCord),
-        GoalsAfter = list(GoalsAfterCord),
-        ParConjs = map((func(CordI) = seq_conj(list(CordI))),
-            list(ParConjsCord)),
         Candidate = candidate_par_conjunction(goal_path_to_string(GoalPath),
             PartNum, IsDependent, GoalsBefore, ParConjs, GoalsAfter, Metrics),
         ( Speedup > 1.0 ->
@@ -1502,17 +1463,18 @@ pardgoals_build_candidate_conjunction(In
 
 :- type find_costly_call_result
     --->    costly_call_found(
-                fccrfc_goals_before     :: cord(pard_goal_detail),
+                fccrfc_goals_before     :: list(pard_goal_detail),
                 fccrfc_call             :: pard_goal_detail,
-                fccrfc_gaols_after      :: cord(pard_goal_detail)
+                fccrfc_gaols_after      :: list(pard_goal_detail)
             )
     ;       costly_call_not_found.
 
-:- pred find_costly_call(cord(pard_goal_detail)::in, cord(pard_goal_detail)::in,
+:- pred find_costly_call(list(pard_goal_detail)::in, list(pard_goal_detail)::in,
     find_costly_call_result::out) is det.
 
 find_costly_call(Goals, GoalsBefore0, Result) :-
-    ( head_tail(Goals, Goal, GoalsTail) ->
+    (
+        Goals = [Goal | GoalsTail],
         GoalType = Goal ^ goal_annotation ^ pgd_pg_type,
         (
             (
@@ -1526,7 +1488,7 @@ find_costly_call(Goals, GoalsBefore0, Re
                 Result = costly_call_found(GoalsBefore0, Goal, GoalsTail)
             ;
                 CostAboveThreshold = cost_not_above_par_threshold,
-                GoalsBefore = snoc(GoalsBefore0, Goal),
+                GoalsBefore = GoalsBefore0 ++ [Goal],
                 find_costly_call(GoalsTail, GoalsBefore, Result)
             )
         ;
@@ -1534,193 +1496,298 @@ find_costly_call(Goals, GoalsBefore0, Re
             error(this_file ++ "Found non-atomic goal")
         )
     ;
+        Goals = [],
         Result = costly_call_not_found
     ).
 
 :- type best_parallelisation
     --->    bp_parallel_execution(
-                bp_goals_before         :: cord(pard_goal_detail),
-                bp_par_conjs            :: cord(cord(pard_goal_detail)),
-                bp_goals_after          :: cord(pard_goal_detail),
+                bp_goals_before         :: list(pard_goal_detail),
+                bp_par_conjs            :: list(seq_conj(pard_goal_detail)),
+                bp_goals_after          :: list(pard_goal_detail),
                 bp_is_dependent         :: conjuncts_are_dependant,
                 bp_par_exec_metrics     :: parallel_exec_metrics
-            )
-                % Rather than using the sequential execution as an initial 
-                % value for the branch and bound search we use this value.
-                % This allows us to report the best parallelisation even when
-                % sequential execution is optimal.
-    ;       bp_no_best_parallelisation.
+            ).
 
 :- type incomplete_parallelisation
     --->    incomplete_parallelisation(
-                ip_goals_before         :: cord(pard_goal_detail),
-                ip_par_conjs            :: cord(cord(pard_goal_detail)),
+                ip_goals_before         :: list(pard_goal_detail),
+                ip_par_conjs            :: list(seq_conj(pard_goal_detail)),
                 ip_par_exec_overlap     :: parallel_execution_overlap,
-                ip_par_exec_metrics     :: parallel_exec_metrics_incomplete,
-                ip_is_outermost_conj    :: is_outermost_conjunct,
-                ip_can_make_cheap_conj  :: can_make_cheap_conj
+                ip_par_exec_metrics     :: parallel_exec_metrics_incomplete
             ).
 
-:- type can_make_cheap_conj
-    --->    can_make_cheap_conj
-    ;       cannot_make_cheap_conj.
-
 :- pred find_best_parallelisation(implicit_parallelism_info::in, 
-    program_location::in, int::in, dependency_maps::in, 
-    cord(pard_goal_detail)::in, best_parallelisation::out) is det.
+    program_location::in, int::in, 
+    list(pard_goal_detail)::in, maybe(best_parallelisation)::out) is det.
+
+find_best_parallelisation(Info, Location, PartNum, Goals,
+        MaybeBestParallelisation) :-
+    preprocess_conjunction(Goals, PreprocessedGoals),
+    PreprocessedGoals = goals_for_parallelisation(GoalGroups, 
+        DependencyMaps, CostlyCallsIndexes, NumCalls),
+    ( last(CostlyCallsIndexes, LastCostlyCallIndexPrime) ->
+        LastCostlyCallIndex = LastCostlyCallIndexPrime
+    ;
+        error(this_file ++ "no costly calls found")
+    ),
+    branch_and_bound(
+        generate_parallelisations(Info, Location, PartNum, LastCostlyCallIndex, 
+            NumCalls, DependencyMaps, GoalGroups),
+        parallelisation_get_par_time,
+        Solutions, Profile),
 
-find_best_parallelisation(Info, Location, PartNum, DependencyMaps, Goals,
-        FinalBestParallelisation) :-
-    % Use a failure driven loop to implement a branch and bound solver.
-    promise_pure (
         trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
-            io.format("D: Branch and bound loop starting:\n%s\n",
+        io.format("D: Find best parallelisation for:\n%s\n",
                 [s(LocationString)], !IO),
             location_to_string(1, Location, LocationCord),
-            string.append_list(list(LocationCord), LocationString)
+        string.append_list(list(LocationCord), LocationString),
+        io.format("D: Problem size: %d Solutions: %d\n",
+            [i(length(GoalGroups)), i(count(Solutions))], !IO),
+        io.format("D: Branch and bound profile: %s\n\n",
+            [s(string(Profile))], !IO)
         ),
-        impure new_mutvar(bp_no_best_parallelisation,
-            BestParallelisationMutvar),
-        (
-            impure find_best_parallelisation_nondet(Info, Location, PartNum, 
-                DependencyMaps, Goals, BestParallelisationMutvar, 
-                BestParallelisation),
-            % this is the best parallelisation so far.
-            impure set_mutvar(BestParallelisationMutvar, BestParallelisation),
-            trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
+    
                 (
-                    BestParallelisation = bp_no_best_parallelisation
-                ;
-                    BestParallelisation = bp_parallel_execution(_, _, _, _,
-                        Metrics),
-                    io.format("D: Branch and bound: Solution par time: %f\n",
-                        [f(ParTime)], !IO),
-                    ParTime = parallel_exec_metrics_get_par_time(Metrics)
-                )
-            ),
-            semidet_fail
+        promise_equivalent_solutions [BestParallelisation]
+        member(BestParallelisation, Solutions) 
         ->
-            error("Failure driven loop must fail")
+        MaybeBestParallelisation = yes(BestParallelisation)
         ;
-            impure get_mutvar(BestParallelisationMutvar, 
-                FinalBestParallelisation),
-            trace [compile_time(flag("debug_branch_and_bound")), io(!IO)] (
-                io.write_string("D: Branch and bound loop finished\n", !IO)
-            )
-        )
+        MaybeBestParallelisation = no
     ).
 
-:- impure pred find_best_parallelisation_nondet(implicit_parallelism_info::in,
-    program_location::in, int::in, dependency_maps::in, 
-    cord(pard_goal_detail)::in, 
-    mutvar(best_parallelisation)::in, best_parallelisation::out) is nondet.
-
-find_best_parallelisation_nondet(Info, Location, PartNum, DependencyMaps, Goals0, 
-        BestParallelisationMutvar, BestParallelisation) :-
-    find_costly_call(Goals0, cord.empty, FindCostlyCallResult),
-    (
-        FindCostlyCallResult = costly_call_found(GoalsBeforeFirstCall, 
-            FirstCall, GoalsAfterFirstCall)
-    ;
-        FindCostlyCallResult = costly_call_not_found,
-        location_to_string(1, Location, LocationString),
-        Error = singleton(this_file) ++ singleton("\n") ++
-            LocationString ++
-            nl_indent(1) ++ singleton(format("partition %d", [i(PartNum)])) ++
-            nl_indent(1) ++ singleton("Couldn't find first call\n"),
-        error(append_list(cord.list(Error)))
-    ),
-    FirstCallCallSite = FirstCall ^ goal_annotation ^ pgd_pg_type 
-        ^ pgtc_call_site,
-    NumCalls = FirstCallCallSite ^ ccsr_call_site_summary ^
-        perf_row_calls,
-   
-    % Generate goals before conjunction.
-    cord_split(GoalsBeforeFirstCall, GoalsBeforeConj, 
-        GoalsBeforeCallInFirstConj),
-    Goals1 = GoalsBeforeCallInFirstConj ++ singleton(FirstCall) ++ 
-        GoalsAfterFirstCall,
-
-    foldl_pred(pardgoal_calc_cost, GoalsBeforeConj, 0.0, CostBeforeConj),
-    Metrics0 = init_empty_parallel_exec_metrics(CostBeforeConj),
-    Parallelisation0 = incomplete_parallelisation(GoalsBeforeConj, 
-        empty, peo_empty_conjunct, Metrics0, is_outermost_conjunct,
-        cannot_make_cheap_conj),
-    impure find_best_parallelisation_2(Info, DependencyMaps, 
-        BestParallelisationMutvar, 0, map.init, Goals1, 
-        GoalsAfterConj, Parallelisation0, Parallelisation1),
-
-    % Finalise the metrics and determine if this is the best solution so far.
-    foldl_pred(pardgoal_calc_cost, GoalsAfterConj, 0.0, CostAfterConj),
-    Metrics1 = Parallelisation1 ^ ip_par_exec_metrics,
-    SparkDelay = Info ^ ipi_opts ^ cpc_sparking_delay,
-    SparkCost = Info ^ ipi_opts ^ cpc_sparking_cost,
-    Metrics = finalise_parallel_exec_metrics(Metrics1, NumCalls, CostAfterConj, 
-        float(SparkDelay + SparkCost)),
-
-    impure get_mutvar(BestParallelisationMutvar, CurrentBestParallelisation),
-    cbmr_metrics_is_better = compare_best_parallelisation_and_metrics(
-        CurrentBestParallelisation, Metrics),
+:- type goal_group(T)
+    --->    gg_singleton(pard_goal_detail, T)
+    ;       gg_group(int, list(pard_goal_detail), T).
+
+:- inst gg_singleton
+    --->    gg_singleton(ground, ground).
+
+% NOTE: These commented out types are relevant for some work that hasn't been
+% done.  They will either be used or removed in a future change.
 
-    Conjuncts = Parallelisation1 ^ ip_par_conjs,
-    Overlap = Parallelisation1 ^ ip_par_exec_overlap,
+%:- type goal_placement
+%    --->    goal_placement(
+%                gp_can_split_group          :: can_split_group,
+%                gp_placement_choices        :: set(goal_placement_enum)
+%            ).
+
+    % The valid placement decisions that can be made for a given goal.
+    %
+    % Note that this type is not as expressive as it could be, this is
+    % deliberate to avoid symmetry.  It is however still complete.
+    %
+    %  + place_independent is just gpe_place_in_new_conj with the next goal
+    %    having gpe_place_in_new_conj.
+    %
+    %  + place_with_next is just gpe_place_in_new_conj with the next goal
+    %    having place_with_previous.
+    %
+%:- type goal_placement_enum
+%    --->    gpe_place_before_par_conj
+%    ;       gpe_place_after_par_conj
+%    ;       gpe_place_with_previous
+%    ;       gpe_place_in_new_conj.
+%
+%:- type can_split_group
+%    --->    can_split_group
+%    ;       cannot_split_group.
+
+:- type goal_classification
+    --->    gc_cheap_goals
+    ;       gc_costly_goals.
+
+:- type goals_for_parallelisation
+    --->    goals_for_parallelisation(
+                gfp_groups                  :: 
+                    list(goal_group(goal_classification)),
+
+                gfp_dependency_maps         :: dependency_maps,
+                gfp_costly_call_indexes     :: list(int),
+                gfp_num_calls               :: int
+            ).
+
+:- pred preprocess_conjunction(list(pard_goal_detail)::in,
+    goals_for_parallelisation::out) is det.
+
+preprocess_conjunction(Goals0, GoalsForParallelisation) :-
+    % Phase 1: Build a dependency map.
+    build_dependency_maps(Goals0, DependencyMaps),
+    % Phase 2: Find the costly calls.
+    identify_costly_calls(Goals0, 1, GoalGroups, CostlyCallsIndexes),
+   
+    % Get the number of calls into this conjunction.
+    (
+        CostlyCallsIndexes = [FirstCostlyCallIndex | _],
+        list.index1(Goals0, FirstCostlyCallIndex, FirstCostlyCall),
+        GoalType = FirstCostlyCall ^ goal_annotation ^ pgd_pg_type,
+        GoalType = pgt_call(_, _, _, CallSite)
+    ->
+        NumCalls = CallSite ^ ccsr_call_site_summary ^ perf_row_calls
+    ;
+        error(this_file ++ "Expected call goal")
+    ),
+
+    GoalsForParallelisation = goals_for_parallelisation(GoalGroups,
+        DependencyMaps, CostlyCallsIndexes, NumCalls).
+
+    % identify_costly_calls(Goals, 1, GoalGroups, SortedCostlyIndexes).
+    %
+    % GoalGroups are Goals divided into groups of single costly calls and
+    % multiple goals in-between these calls.  SortedCostlyIndexes are the
+    % indexes of the costly calls in the original list (starting at 1).  This
+    % predicate is undefined if any of the goals in Goals are non-atomic.
+    %
+:- pred identify_costly_calls(list(pard_goal_detail)::in, int::in,
+    list(goal_group(goal_classification))::out(list(gg_singleton)), 
+    list(int)::out) is det. 
+
+identify_costly_calls([], _, [], []).
+identify_costly_calls([Goal | Goals], Index, GoalGroups, Indexes) :-
+    identify_costly_calls(Goals, Index+1, GoalGroups0, Indexes0),
+    identify_costly_call(Goal, Costly),
+    (
+        Costly = is_costly_goal,
+        GoalClassification = gc_costly_goals,
+        Indexes = [Index | Indexes0]
+    ;
+        Costly = is_not_costly_goal,
+        GoalClassification = gc_cheap_goals,
+        Indexes = Indexes0
+    ;
+        Costly = is_non_atomic_goal,
+        error(this_file ++ "Unexpected pgt_non_atomic_goal")
+    ),
+    GoalGroup = gg_singleton(Goal, GoalClassification),
+    GoalGroups = [GoalGroup | GoalGroups0].
+
+:- func parallelisation_get_par_time(best_parallelisation) = float.
+
+parallelisation_get_par_time(Parallelisation) = ParTime :-
+    Metrics = Parallelisation ^ bp_par_exec_metrics,
+    ParTime = parallel_exec_metrics_get_par_time(Metrics).
+
+:- semipure pred generate_parallelisations(implicit_parallelism_info::in,
+    program_location::in, int::in, int::in, int::in, dependency_maps::in,
+    list(goal_group(goal_classification))::in, 
+    bnb_state(best_parallelisation)::in, best_parallelisation::out) is nondet.
+
+generate_parallelisations(Info, _Location, _PartNum, LastCostlyCallIndex,
+        NumCalls, DependencyMaps, !.GoalGroups, BNBState, 
+        BestParallelisation) :-
+    SparkCost = Info ^ ipi_opts ^ cpcp_sparking_cost,
+    SparkDelay = Info ^ ipi_opts ^ cpcp_sparking_delay,
+    ContextWakeupDelay = Info ^ ipi_opts ^ cpcp_context_wakeup_delay,
+
+    some [!GoalNum, !Parallelisation] (
+        !:GoalNum = 1,
+        
+        generate_parallelisations_goals_before(GoalsBeforeConj, !GoalNum,
+            !GoalGroups),
+        foldl(pardgoal_calc_cost, GoalsBeforeConj, 0.0, CostBeforeConj),
+        Metrics0 = init_empty_parallel_exec_metrics(CostBeforeConj, NumCalls, 
+            float(SparkCost), float(SparkDelay), float(ContextWakeupDelay)),
+        !:Parallelisation = incomplete_parallelisation(GoalsBeforeConj, 
+            [], peo_empty_conjunct, Metrics0),
+
+        semipure generate_parallelisations_body(Info, DependencyMaps, 0,
+            map.init, BNBState, LastCostlyCallIndex, !GoalNum, !GoalGroups,
+            !Parallelisation),
+
+        generate_parallelisations_goals_after(!.GoalNum, !.GoalGroups,
+            GoalsAfterConj),
+       
+        Parallelisation = !.Parallelisation
+    ),
+    
+    % Finalise the metrics and determine if this is the best solution so
+    % far.
+    foldl(pardgoal_calc_cost, GoalsAfterConj, 0.0, CostAfterConj),
+    Metrics1 = Parallelisation ^ ip_par_exec_metrics,
+    Metrics = finalise_parallel_exec_metrics(Metrics1, CostAfterConj),
+
+    Conjuncts = Parallelisation ^ ip_par_conjs,
+    Overlap = Parallelisation ^ ip_par_exec_overlap,
     par_conj_overlap_is_dependant(Overlap, IsDependent),
     BestParallelisation = bp_parallel_execution(GoalsBeforeConj, Conjuncts,
-        GoalsAfterConj, IsDependent, Metrics).
-
-:- impure pred find_best_parallelisation_2(implicit_parallelism_info::in,
-    dependency_maps::in, mutvar(best_parallelisation)::in, 
-    int::in, map(var_rep, float)::in, 
-    cord(pard_goal_detail)::in, cord(pard_goal_detail)::out, 
-    incomplete_parallelisation::in, incomplete_parallelisation::out) is nondet. 
+        GoalsAfterConj, IsDependent, Metrics),
+    semipure test_incomplete_solution(BNBState, BestParallelisation).
 
-find_best_parallelisation_2(Info, DependencyMaps, BestParallelisationMutvar, 
-        NumConjuncts0, ProductionsMap0, !Goals, !Parallelisation) :-
-    IsOutermostConjunct = !.Parallelisation ^ ip_is_outermost_conj,
-    find_costly_call(!.Goals, cord.empty, FindCostlyCallResult),
-    (
-        FindCostlyCallResult = 
-            costly_call_found(GoalsBefore0, Call, GoalsAfter0),
-        
-        (
-            can_make_cheap_conj = !.Parallelisation ^ ip_can_make_cheap_conj,
-            find_costly_call(GoalsAfter0, cord.empty, 
-                costly_call_found(_, _, _)),
-            cord_split(GoalsBefore0, Conj, GoalsBefore),
-            !:Goals = snoc(GoalsBefore, Call) ++ GoalsAfter0,
-            !Parallelisation ^ ip_can_make_cheap_conj := cannot_make_cheap_conj
-        ;
-            % Determine how many goals to include after the call in this
-            % parallel conjunct.
-            cord_split(GoalsAfter0, GoalsAfter, !:Goals),
-            % Don't include two costly calls in the same parallel conjunct.
-            find_costly_call(GoalsAfter, cord.empty, costly_call_not_found),
-        
-            % Build the new conjunct and test to see if this is no worse than
-            % our best parallelisation.
-            Conj = snoc(GoalsBefore0, Call) ++ GoalsAfter,
-
-            % If we've found the last costly call then there are no more
-            % conjuncts to make and therefore we cannot make a cheap conjunct,
-            % otherwise we can make a cheap conjunct.
-            find_costly_call(!.Goals, cord.empty, Result2),
-            (
-                Result2 = costly_call_not_found,
-                !Parallelisation ^ ip_can_make_cheap_conj :=
-                    cannot_make_cheap_conj
+:- pred generate_parallelisations_goals_before(list(pard_goal_detail)::out, 
+    int::in, int::out, 
+    list(goal_group(goal_classification))::in, 
+    list(goal_group(goal_classification))::out) is multi. 
+
+generate_parallelisations_goals_before([], !GoalNum, !GoalGroups).
+generate_parallelisations_goals_before(Goals, !GoalNum, !GoalGroups) :-
+    !.GoalGroups = [GoalGroup | !:GoalGroups],
+    (
+        GoalGroup = gg_singleton(Goal, Classification),
+        !:GoalNum = !.GoalNum + 1,
+        NewGoals = [Goal]
             ;
-                Result2 = costly_call_found(_, _, _),
-                !Parallelisation ^ ip_can_make_cheap_conj := can_make_cheap_conj
-            )
+        GoalGroup = gg_group(Num, NewGoals, Classification),
+        !:GoalNum = !.GoalNum + Num
         ),
-        not is_empty(Conj),
-        Conjs0 = !.Parallelisation ^ ip_par_conjs,
-        Conjs = snoc(Conjs0, Conj),
+    Classification = gc_cheap_goals,
+    generate_parallelisations_goals_before(Goals0, !GoalNum, !GoalGroups),
+    Goals = NewGoals ++ Goals0.
         
+:- pred generate_parallelisations_goals_after(int::in, 
+    list(goal_group(goal_classification))::in, list(pard_goal_detail)::out) 
+    is det.
+
+generate_parallelisations_goals_after(_, [], []).
+generate_parallelisations_goals_after(Num0, [GG | GGs], Goals) :-
+    (
+        GG = gg_singleton(Goal, _),
+        Num = Num0 + 1,
+        NewGoals = [Goal]
+    ;
+        GG = gg_group(NewNum, NewGoals, _),
+        Num = Num0 + NewNum
+    ),
+    generate_parallelisations_goals_after(Num, GGs, Goals0),
+    Goals = NewGoals ++ Goals0.
+
+:- semipure pred generate_parallelisations_body(implicit_parallelism_info::in,
+    dependency_maps::in, int::in, map(var_rep, float)::in,
+    bnb_state(best_parallelisation)::in, int::in, int::in, int::out,
+    list(goal_group(goal_classification))::in, 
+    list(goal_group(goal_classification))::out,
+    incomplete_parallelisation::in, incomplete_parallelisation::out) is nondet. 
+
+generate_parallelisations_body(Info, DependencyMaps, NumConjuncts0,
+        ProductionsMap0, BNBState, LastCostlyCallIndex, !GoalNum, !GoalGroups,
+        !Parallelisation) :-
+    (
+        !.GoalNum > LastCostlyCallIndex
+    ->
+        % if we have already visited all the costly calls then there are no
+        % further parallelisations to make.
+        % Verify that we've generated at least two parallel conjuncts,
+        NumConjuncts0 >= 1
+    ;
+        % We continue building more parallel conjuncts.
+        !.GoalGroups = [_ | _],
+        generate_parallel_conjunct(ParConjGoals, !GoalNum, !GoalGroups),
+        
+        % Don't build a single parallel conjunct containing all the costly
+        % calls.
+        ( NumConjuncts0 = 0 => LastCostlyCallIndex >= !.GoalNum ),
+        
+        Conjs0 = !.Parallelisation ^ ip_par_conjs,
+        Conjs = Conjs0 ++ [seq_conj(ParConjGoals)],
         Metrics0 = !.Parallelisation ^ ip_par_exec_metrics,
         Overlap0 = !.Parallelisation ^ ip_par_exec_overlap,
-        calculate_parallel_cost_step(Info, IsOutermostConjunct, NumConjuncts0,
-            Conj, ProductionsMap0, ProductionsMap, Metrics0, Metrics, 
+        ( LastCostlyCallIndex >= !.GoalNum ->
+            IsInnermostConjunct = is_not_innermost_conjunct
+        ;
+            IsInnermostConjunct = is_innermost_conjunct
+        ),
+        calculate_parallel_cost_step(Info, IsInnermostConjunct, NumConjuncts0,
+            ParConjGoals, ProductionsMap0, ProductionsMap, Metrics0, Metrics, 
             Overlap0, Overlap),
         (
             Overlap = peo_empty_conjunct,
@@ -1728,60 +1795,64 @@ find_best_parallelisation_2(Info, Depend
         ;
             Overlap = peo_conjunction(_, _, DepVars)
         ),
-        ParalleliseDepConjs = Info ^ ipi_opts ^ cpc_parallelise_dep_conjs,
+        
+        % Reject parallelisations that have dependant variables if we're not
+        % allowed to create such parallelisations.
+        ParalleliseDepConjs = Info ^ ipi_opts ^ cpcp_parallelise_dep_conjs,
         (
-            ParalleliseDepConjs = do_not_parallelise_dep_conjs
-        =>
+            ParalleliseDepConjs = do_not_parallelise_dep_conjs,
             set.empty(DepVars)
+        ;
+            ParalleliseDepConjs = parallelise_dep_conjs
         ),
-        SparkCost = Info ^ ipi_opts ^ cpc_sparking_cost,
-        SparkDelay = Info ^ ipi_opts ^ cpc_sparking_delay,
-        MetricsComplete = finalise_parallel_exec_metrics(Metrics, 1, 0.0, 
-            float(SparkDelay + SparkCost)),
-        impure get_mutvar(BestParallelisationMutvar, 
-            CurrentBestParallelisation),
-        cbmr_metrics_is_better = compare_best_parallelisation_and_metrics(
-            CurrentBestParallelisation, MetricsComplete), 
+        
+        MetricsComplete = finalise_parallel_exec_metrics(Metrics, 0.0),
+        par_conj_overlap_is_dependant(Overlap, IsDependent),
+        Conjuncts = !.Parallelisation ^ ip_par_conjs,
+        GoalsBeforeConj = !.Parallelisation ^ ip_goals_before,
+        IncompleteParallelisation = bp_parallel_execution(GoalsBeforeConj,
+            Conjuncts, [], IsDependent, MetricsComplete),
+        semipure test_incomplete_solution(BNBState, IncompleteParallelisation),
 
         NumConjuncts = NumConjuncts0 + 1,
         !Parallelisation ^ ip_par_exec_overlap := Overlap,
         !Parallelisation ^ ip_par_exec_metrics := Metrics,
         !Parallelisation ^ ip_par_conjs := Conjs,
-        !Parallelisation ^ ip_is_outermost_conj := is_not_outermost_conjunct,
 
-        impure find_best_parallelisation_2(Info, DependencyMaps, 
-            BestParallelisationMutvar, 
-            NumConjuncts, ProductionsMap, !Goals, !Parallelisation)
-    ;
-        FindCostlyCallResult = costly_call_not_found
+        semipure generate_parallelisations_body(Info, DependencyMaps, 
+            NumConjuncts, ProductionsMap, BNBState, LastCostlyCallIndex,
+            !GoalNum, !GoalGroups, !Parallelisation)
     ).
 
-:- type compare_best_and_metrics_result
-    --->    cbmr_metrics_is_better
-    ;       cbmr_metrics_is_not_better.
+:- pred generate_parallel_conjunct(
+    list(pard_goal_detail)::out(non_empty_list), int::in, int::out, 
+    list(goal_group(goal_classification))::in(non_empty_list),
+    list(goal_group(goal_classification))::out) is multi.
 
-    % Compare the best parallelisation with the current parallelisation.
-    %
-:- func compare_best_parallelisation_and_metrics(best_parallelisation,
-        parallel_exec_metrics) = compare_best_and_metrics_result.
-
-compare_best_parallelisation_and_metrics(BestParallelisation, Metrics) = 
-        Result :-
+generate_parallel_conjunct(Goals, !GoalNum, !GoalGroups) :-
+    !.GoalGroups = [GoalGroup | !:GoalGroups],
+    (
+        GoalGroup = gg_singleton(Goal, _),
+        NewGoals = [Goal],
+        NumNewGoals = 1
+    ;
+        GoalGroup = gg_group(NumNewGoals, NewGoals, _)
+    ),
+    !:GoalNum = !.GoalNum + NumNewGoals,
     (
-        BestParallelisation = bp_no_best_parallelisation,
-        Result = cbmr_metrics_is_better
+        !.GoalGroups = [],
+        Goals0 = []
     ;
-        BestParallelisation = bp_parallel_execution(_, _, _, _, BestMetrics),
-        BestParTime = parallel_exec_metrics_get_par_time(BestMetrics),
-        ParTime = parallel_exec_metrics_get_par_time(Metrics),
-        % TODO: Add other comparisons for tie breaking or a better optimisation
-        % formula.
-        ( BestParTime > ParTime ->
-            Result = cbmr_metrics_is_better
+        !.GoalGroups = [_ | _],
+        % With these disjuncts in this order the predicate will return larger
+        % conjuncts first.
+        (
+            generate_parallel_conjunct(Goals0, !GoalNum, !GoalGroups)
         ;
-            Result = cbmr_metrics_is_not_better
+            Goals0 = []
         )
-    ).
+    ),
+    Goals = NewGoals ++ Goals0.
 
     % This datastructure represents the execution of dependant conjuncts, it
     % tracks which variabes are produced and consumed.
@@ -1813,9 +1884,9 @@ compare_best_parallelisation_and_metrics
                     % references for those variables that will become futures.
             ).
 
-:- type is_outermost_conjunct
-    --->    is_outermost_conjunct
-    ;       is_not_outermost_conjunct.
+:- type is_innermost_conjunct
+    --->    is_innermost_conjunct
+    ;       is_not_innermost_conjunct.
 
     % calculate_parallel_cost(Info, Conjunctions, IsOutermostConjunct,
     %   ParallelExecMetrics, ParallelExecOverlap, ProductionsMap, NumConjuncts).
@@ -1837,14 +1908,13 @@ compare_best_parallelisation_and_metrics
     %   CallerVars.
     %
 :- pred calculate_parallel_cost_step(implicit_parallelism_info::in,
-    is_outermost_conjunct::in, int::in, cord(pard_goal_detail)::in, 
+    is_innermost_conjunct::in, int::in, list(pard_goal_detail)::in, 
     map(var_rep, float)::in, map(var_rep, float)::out,
     parallel_exec_metrics_incomplete::in, parallel_exec_metrics_incomplete::out,
     parallel_execution_overlap::in, parallel_execution_overlap::out) is det.
 
-calculate_parallel_cost_step(Info, IsOutermostConjunct, NumConjuncts, Conj,
+calculate_parallel_cost_step(Info, IsInnermostConjunct, NumConjuncts, Goals,
         !ProductionsMap, !Metrics, !Overlap) :-
-    Goals = list(Conj),
     foldl(pardgoal_calc_cost, Goals, 0.0, CostB),
     foldl(pardgoal_consumed_vars_accum, Goals, set.init,
         RightConsumedVars),
@@ -1855,18 +1925,20 @@ calculate_parallel_cost_step(Info, IsOut
     % This conjunct will actually start after it has been sparked by
     % the prevous conjunct.  Which in turn may have been sparked by an
     % earlier conjunct.
-    SparkDelay = Info ^ ipi_opts ^ cpc_sparking_delay, 
-    StartTime0 = (NumConjuncts * SparkDelay),
+    SparkDelay = Info ^ ipi_opts ^ cpcp_sparking_delay, 
+    StartTime0 = float(NumConjuncts * SparkDelay),
 
     % If there are conjuncts after this conjunct we will have the
     % additional cost of sparking them.
     (
-        IsOutermostConjunct = is_not_outermost_conjunct,
-        SparkCost = Info ^ ipi_opts ^ cpc_sparking_cost,
-        StartTime = float(StartTime0 + SparkCost)
+        IsInnermostConjunct = is_not_innermost_conjunct,
+        % The cost of sparking a computation, (that is calling fork) is charged
+        % to the leftmost conjunct.
+        SparkCost = Info ^ ipi_opts ^ cpcp_sparking_cost,
+        StartTime = StartTime0 + float(SparkCost)
     ;
-        IsOutermostConjunct = is_outermost_conjunct,
-        StartTime = float(StartTime0)
+        IsInnermostConjunct = is_innermost_conjunct,
+        StartTime = StartTime0
     ),
 
     % Get the list of variables consumed by this conjunct that will be
@@ -1880,12 +1952,17 @@ calculate_parallel_cost_step(Info, IsOut
         ConsumptionsList, 0.0, LastSeqConsumeTime, 
         StartTime, LastParConsumeTime, StartTime, LastResumeTime, 
         [], RevExecution0, map.init, ConsumptionsMap),
-    RevExecution = [ (LastResumeTime - CostBPar) | RevExecution0 ],
-    reverse(RevExecution, Execution),
 
+    % Calculate the point at which this conjunct finishes execution and
+    % complete the RevExecutions structure..
+    reverse(RevExecution, Execution),
     CostBPar = LastParConsumeTime + (CostB - LastSeqConsumeTime),
-    !:Metrics = 
-        init_parallel_exec_metrics_incomplete(!.Metrics, CostB, CostBPar),
+    RevExecution = [ (LastResumeTime - CostBPar) | RevExecution0 ],
+    
+    CostSignals = 
+        float(Info ^ ipi_opts ^ cpcp_future_signal_cost * count(Vars)),
+    !:Metrics = init_parallel_exec_metrics_incomplete(!.Metrics, CostSignals,
+        CostB, CostBPar),
     
     % Build the productions map for our parent.  This map contains all
     % the variables produced by this code, not just that are used for
@@ -1897,8 +1974,39 @@ calculate_parallel_cost_step(Info, IsOut
         !.ProductionsMap, ConsumptionsMap),
     !:Overlap = peo_conjunction(!.Overlap, DepConjExec, Vars).
 
+    % calculate_dependant_parallel_cost_2(Info, ProductionsMap, 
+    %   Var - SeqConsTime, !PrevSeqConsumeTime, !PrevParConsumeTime,
+    %   !ResumeTime, !RevExecution, !ConsumptionsMap). 
+    %
     % The main loop of the parallel overlap analysis.
     %
+    % * ProductionsMap: A map of variable productions to the left of this
+    %   conjunct.
+    %
+    % * Var: The current variable under consideration.
+    %
+    % * SeqConsTime: The consumption time of the Var during sequential
+    %   execution.
+    %
+    % * !PrevSeqConsumeTime: Accumulates the time of the previous consumption
+    %   during sequential execution, or if there is none it represents the
+    %   beginning of sequential execution (0.0).
+    %
+    % * !PrevParConsumeTime: Accumulates the time of the previous consumption
+    %   during parallel execution.  Or if there is none this represents the
+    %   tame that the parallel conjunct first begun execution.
+    %
+    % * !ResumeTime: Accumulates the time that execution last resumed if it
+    %   became blocked, or the beginning of the parallel conjunct's execution.
+    %
+    % * !RevExecution: Accumulates a list of pairs, each pair stores the time
+    %   that execution begun and the time that it pasted.  This never includes
+    %   the remaining execution after all variables have been consumed.  This
+    %   is used by our caller to calculate the production times of this
+    %   conjunct for later ones.
+    %
+    % * !ConsumptionsMap: Accumuates a map of variable consumptions.
+    %
 :- pred calculate_dependant_parallel_cost_2(implicit_parallelism_info::in, 
     map(var_rep, float)::in, pair(var_rep, float)::in, float::in, float::out,
     float::in, float::out, float::in, float::out,
@@ -1908,7 +2016,6 @@ calculate_parallel_cost_step(Info, IsOut
 calculate_dependant_parallel_cost_2(Info, ProductionsMap, Var - SeqConsTime,
         !PrevSeqConsumeTime, !PrevParConsumeTime, !ResumeTime,
         !RevExecution, !ConsumptionsMap) :-
-    FutureSyncTime = float(Info ^ ipi_opts ^ cpc_locking_cost),
     map.lookup(ProductionsMap, Var, ProdTime),
 
     % Consider (P & Q):
@@ -1922,10 +2029,11 @@ calculate_dependant_parallel_cost_2(Info
     % waiting, and SeqConsTime - !.PrevSeqConsumeTime is how long Q will take
     % between the two waits.
     %
-    ParConsTimeBlocked = ProdTime + FutureSyncTime,
+    ParConsTimeBlocked = ProdTime,
     ParConsTimeNotBlocked = !.PrevParConsumeTime + 
         (SeqConsTime - !.PrevSeqConsumeTime),
-    ParConsTime = max(ParConsTimeBlocked, ParConsTimeNotBlocked),
+    ParConsTime0 = max(ParConsTimeBlocked, ParConsTimeNotBlocked) + 
+        float(Info ^ ipi_opts ^ cpcp_future_wait_cost),
         
     ( 
         % True if Q had to suspend waiting for P,  Not that we don't include
@@ -1933,11 +2041,14 @@ calculate_dependant_parallel_cost_2(Info
         % it can be made runable before the context switch is complete.
         ProdTime > ParConsTimeNotBlocked 
     ->
+        % Include the time that it may take to resume this thread.
+        ParConsTime = ParConsTime0 +
+            float(Info ^ ipi_opts ^ cpcp_context_wakeup_delay),
         !:RevExecution = 
             [ (!.ResumeTime - ParConsTimeNotBlocked) | !.RevExecution ],
         !:ResumeTime = ParConsTime
     ;
-        true
+        ParConsTime = ParConsTime0
     ),
 
     !:PrevSeqConsumeTime = SeqConsTime,
@@ -1984,6 +2095,7 @@ pardgoal_calc_cost(Goal, !Cost) :-
 
 :- type dependency_maps
     ---> dependency_maps(
+            % XXX: I'm pretty certain that these are transitive closures.
             dm_depends_on           :: map(int, set(int)),
                 % This map maps from conjunct numbers to the conjunct numbers
                 % of goals that they depend upon.
@@ -2317,7 +2429,7 @@ can_parallelise_call(Info, Detism, CallS
         % This is conditional so that we can gauretee that it never causes a
         % divide by zero error,
         PercallCost = cs_cost_get_percall(CallSiteCost),
-        PercallCost > float(Info ^ ipi_opts ^ cpc_call_site_threshold)
+        PercallCost > float(Info ^ ipi_opts ^ cpcp_call_site_threshold)
     ;
         fail 
     ).
@@ -2390,6 +2502,32 @@ maybe_costly_call(Info, GoalPath, Atomic
             CallSite) 
     ).
 
+:- type is_costly_goal
+    --->    is_costly_goal
+    ;       is_not_costly_goal
+    ;       is_non_atomic_goal.
+
+:- pred identify_costly_call(pard_goal_detail::in, is_costly_goal::out) is det.
+
+identify_costly_call(Goal, Costly) :-
+    GoalType = Goal ^ goal_annotation ^ pgd_pg_type,
+    (
+        GoalType = pgt_call(_, CostAboveThreshold, _, _),
+        (
+            CostAboveThreshold = cost_above_par_threshold,
+            Costly = is_costly_goal 
+        ;
+            CostAboveThreshold = cost_not_above_par_threshold,
+            Costly = is_not_costly_goal
+        )
+    ;
+        GoalType = pgt_other_atomic_goal,
+        Costly = is_not_costly_goal
+    ;
+        GoalType = pgt_non_atomic_goal,
+        Costly = is_non_atomic_goal
+    ).
+
 :- pred var_get_mode(inst_map::in, inst_map::in, var_rep::in, var_mode_rep::out)
     is det.
 
@@ -3069,27 +3207,5 @@ format_callee(named_callee(Module, Proc)
     format("%s.%s", [s(Module), s(Proc)])).
 
 %-----------------------------------------------------------------------------%
-
-:- pred cord_split(cord(T)::in, cord(T)::out, cord(T)::out) is multi.
-
-cord_split(Cord, A, B) :-
-    ( cord.head_tail(Cord, Head, Tail) ->
-        (
-            % Put the split before Cord,
-            A = cord.empty,
-            B = Cord
-        ;
-            % Put the split within or after Cord.
-            cord_split(Tail, TailA, B),
-            A = cons(Head, TailA)
-        )
-    ;
-        % An empty cord can only be split one way (since we say that the split
-        % may happen before the cord).
-        A = cord.empty,
-        B = cord.empty
-    ).
-
-%-----------------------------------------------------------------------------%
 :- end_module mdprof_fb.automatic_parallelism.
 %-----------------------------------------------------------------------------%
Index: deep_profiler/mdprof_feedback.m
===================================================================
RCS file: /home/mercury1/repository/mercury/deep_profiler/mdprof_feedback.m,v
retrieving revision 1.24
diff -u -p -b -r1.24 mdprof_feedback.m
--- deep_profiler/mdprof_feedback.m	4 Jul 2010 10:24:09 -0000	1.24
+++ deep_profiler/mdprof_feedback.m	14 Jul 2010 00:29:54 -0000
@@ -181,19 +181,35 @@ create_feedback_report(feedback_data_cal
         Report) :-
    Report = "  feedback_data_calls_above_threshold_sorted is deprecated\n".
 create_feedback_report(feedback_data_candidate_parallel_conjunctions(
-            DesiredParallelism, SparkingCost, SparkingDelay, LockingCost,
-            Conjs),
-        Report) :-
+        Parameters, Conjs), Report) :-
     NumConjs = length(Conjs),
+    Parameters = candidate_par_conjunctions_params(DesiredParallelism,
+        SparkingCost, SparkingDelay, SignalCost, WaitCost, 
+        ContextWakeupDelay, CliqueThreshold, CallSiteThreshold,
+        ParalleliseDepConjs),
     ReportHeader = singleton(format("  Candidate Parallel Conjunctions:\n" ++
             "    Desired parallelism: %f\n" ++
             "    Sparking cost: %d\n" ++
             "    Sparking delay: %d\n" ++
-            "    Locking cost: %d\n" ++
+            "    Future signal cost: %d\n" ++
+            "    Future wait cost: %d\n" ++
+            "    Context wakeup delay: %d\n" ++
+            "    Clique threshold: %d\n" ++
+            "    Call site threshold: %d\n" ++
+            "    Parallelise dependant conjunctions: %s\n" ++
             "    Number of Parallel Conjunctions: %d\n" ++
             "    Parallel Conjunctions:\n\n",
         [f(DesiredParallelism), i(SparkingCost), i(SparkingDelay),
-         i(LockingCost), i(NumConjs)])),
+         i(SignalCost), i(WaitCost), i(ContextWakeupDelay), 
+         i(CliqueThreshold), i(CallSiteThreshold), s(ParalleliseDepConjsStr),
+         i(NumConjs)])),
+    (
+        ParalleliseDepConjs = parallelise_dep_conjs,
+        ParalleliseDepConjsStr = "yes"
+    ;
+        ParalleliseDepConjs = do_not_parallelise_dep_conjs,
+        ParalleliseDepConjsStr = "no"
+    ),
     map(create_candidate_parallel_conj_proc_report, Conjs, ReportConjs),
     Report = append_list(list(ReportHeader ++ cord_list_to_cord(ReportConjs))).
 
@@ -244,10 +260,17 @@ help_message =
                 The time taken from the time a spark is created until the spark
                 is executed by another processor assuming that there is a free
                 processor.
-    --implicit-parallelism-locking-cost <value>
-                The cost of maintaining a lock for a single dependant variable
-                in a conjunction, measured in the profiler's call sequence
-                counts.
+    --implicit-parallelism-future-signal-cost <value>
+                The cost of the signal() call for the producer of a shared
+                variable, measured in the profiler's call sequence counts.
+    --implicit-parallelism-future-wait-cost <value>
+                The cost of the wait() call for the consumer of a shared
+                variable, measured in the profiler's call sequence counts.
+    --implicit-parallelism-context-wakeup-delay <value>
+                The time taken for a context to resume execution after being
+                placed on the run queue.  This is used to estimate the impact
+                of blocking of a context's execution, it is measured in the
+                profiler's call sequence counts.
     --implicit-parallelism-clique-cost-threshold <value>
                 The cost threshold for cliques to be considered for implicit
                 parallelism, measured on the profiler's call sequence counts.
@@ -277,13 +300,14 @@ help_message =
                 Produce a list of candidate parallel conjunctions for implicit
                 parallelism.  This option uses the implicit parallelism
                 settings above.
-    ".
+
+".
 
 :- pred write_help_message(string::in, io::di, io::uo) is det.
 
 write_help_message(ProgName, !IO) :-
     Message = help_message,
-    io.format(Message, [s(ProgName)], !IO).
+    io.format(Message, duplicate(4, s(ProgName)), !IO).
 
 :- pred write_version_message(string::in, io::di, io::uo) is det.
 
@@ -345,7 +369,9 @@ read_deep_file(Input, Debug, MaybeDeep, 
     ;       desired_parallelism
     ;       implicit_parallelism_sparking_cost
     ;       implicit_parallelism_sparking_delay
-    ;       implicit_parallelism_locking_cost
+    ;       implicit_parallelism_future_signal_cost
+    ;       implicit_parallelism_future_wait_cost
+    ;       implicit_parallelism_context_wakeup_delay
     ;       implicit_parallelism_clique_cost_threshold
     ;       implicit_parallelism_call_site_cost_threshold
     ;       implicit_parallelism_dependant_conjunctions.
@@ -380,7 +406,12 @@ long("implicit-parallelism",            
 long("desired-parallelism",                 desired_parallelism).
 long("implicit-parallelism-sparking-cost",  implicit_parallelism_sparking_cost).
 long("implicit-parallelism-sparking-delay", implicit_parallelism_sparking_delay).
-long("implicit-parallelism-locking-cost",   implicit_parallelism_locking_cost).
+long("implicit-parallelism-future-signal-cost",
+    implicit_parallelism_future_signal_cost).
+long("implicit-parallelism-future-wait-cost",
+    implicit_parallelism_future_wait_cost).
+long("implicit-parallelism-context-wakeup-delay",
+    implicit_parallelism_context_wakeup_delay).
 long("implicit-parallelism-clique-cost-threshold", 
     implicit_parallelism_clique_cost_threshold).
 long("implicit-parallelism-call-site-cost-threshold",
@@ -407,7 +438,9 @@ defaults(desired_parallelism,           
 % be tested for.
 defaults(implicit_parallelism_sparking_cost,                int(100)).
 defaults(implicit_parallelism_sparking_delay,               int(1000)).
-defaults(implicit_parallelism_locking_cost,                 int(100)).
+defaults(implicit_parallelism_future_signal_cost,           int(100)).
+defaults(implicit_parallelism_future_wait_cost,             int(250)).
+defaults(implicit_parallelism_context_wakeup_delay,         int(1000)).
 defaults(implicit_parallelism_clique_cost_threshold,        int(100000)).
 defaults(implicit_parallelism_call_site_cost_threshold,     int(50000)).
 defaults(implicit_parallelism_dependant_conjunctions,       bool(no)).
@@ -426,7 +459,7 @@ construct_measure("median",     stat_med
                 maybe_calls_above_threshold_sorted
                     :: maybe(calls_above_threshold_sorted_opts),
                 maybe_candidate_parallel_conjunctions
-                    :: maybe(candidate_parallel_conjunctions_opts)
+                    :: maybe(candidate_par_conjunctions_params)
             ).
 
 :- pred check_verbosity_option(option_table(option)::in, int::out) is semidet.
@@ -504,10 +537,14 @@ check_options(Options0, RequestedFeedbac
             SparkingCost),
         lookup_int_option(Options, implicit_parallelism_sparking_delay,
             SparkingDelay),
-        lookup_int_option(Options, implicit_parallelism_locking_cost,
-            LockingCost),
+        lookup_int_option(Options, implicit_parallelism_future_signal_cost,
+            FutureSignalCost),
+        lookup_int_option(Options, implicit_parallelism_future_wait_cost,
+            FutureWaitCost),
+        lookup_int_option(Options, implicit_parallelism_context_wakeup_delay,
+            ContextWakeupDelay),
         lookup_int_option(Options, implicit_parallelism_clique_cost_threshold,
-            CPCProcThreshold),
+            CPCCliqueThreshold),
         lookup_int_option(Options, 
             implicit_parallelism_call_site_cost_threshold,
             CPCCallSiteThreshold),
@@ -522,11 +559,13 @@ check_options(Options0, RequestedFeedbac
             ParalleliseDepConjs = do_not_parallelise_dep_conjs
         ),
         CandidateParallelConjunctionsOpts =
-            candidate_parallel_conjunctions_opts(DesiredParallelism, 
+            candidate_par_conjunctions_params(DesiredParallelism, 
                 SparkingCost,
                 SparkingDelay,
-                LockingCost,
-                CPCProcThreshold,
+                FutureSignalCost,
+                FutureWaitCost,
+                ContextWakeupDelay,
+                CPCCliqueThreshold,
                 CPCCallSiteThreshold,
                 ParalleliseDepConjs),
         MaybeCandidateParallelConjunctionsOpts =
Index: mdbcomp/feedback.m
===================================================================
RCS file: /home/mercury1/repository/mercury/mdbcomp/feedback.m,v
retrieving revision 1.12
diff -u -p -b -r1.12 feedback.m
--- mdbcomp/feedback.m	4 Jul 2010 10:24:09 -0000	1.12
+++ mdbcomp/feedback.m	14 Jul 2010 00:29:54 -0000
@@ -66,19 +66,7 @@
                     % Data of this type represents a list of candidate
                     % conjunctions for implicit parallelism.
 
-                desired_parallelism :: float,
-                    % The number of desired busy sparks.
-
-                sparking_cost       :: int,
-                    % The cost of creating a spark in call sequence counts.
-
-                sparking_delay      :: int,
-                    % The time taken between the creation of the spark and when
-                    % it starts being executed in call sequence counts.
-
-                locking_cost        :: int,
-                    % The cost of maintaining a lock on a single dependant
-                    % variable in call sequence counts.
+                parameters      :: candidate_par_conjunctions_params,
 
                 conjunctions        :: assoc_list(string_proc_label, 
                                         candidate_par_conjunctions_proc)
@@ -88,13 +76,64 @@
 
 :- inst feedback_data_query
     --->    feedback_data_calls_above_threshold_sorted(free, free, free)
-    ;       feedback_data_candidate_parallel_conjunctions(free, free, free,
-                free, free).
+    ;       feedback_data_candidate_parallel_conjunctions(free, free).
 
 :- type stat_measure
     --->    stat_mean
     ;       stat_median.
 
+:- type candidate_par_conjunctions_params
+    --->    candidate_par_conjunctions_params(
+                cpcp_desired_parallelism    :: float,
+                    % The number of desired busy sparks.
+
+                cpcp_sparking_cost          :: int,
+                    % The cost of creating a spark and adding it to the local
+                    % work queue in call sequence counts.
+
+                cpcp_sparking_delay         :: int,
+                    % The time taken between the creation of the spark and when
+                    % it starts being executed in call sequence counts.
+
+                cpcp_future_signal_cost     :: int,
+                cpcp_future_wait_cost       :: int,
+                    % The costs of maintaining a lock on a single dependant
+                    % variable in call sequence counts.  The first gives the
+                    % cost of the call to signal and the second gives the cost
+                    % of the call to wait assuming that the value is already
+                    % available.
+
+                cpcp_context_wakeup_delay   :: int,
+                    % The time it takes for a context to resume execution once
+                    % it has been put on the runnable queue assuming that an
+                    % engine is available to pick it up.  This is measured in
+                    % call sequence counts.
+                    %
+                    % This is used to calculate how soon a context can recover
+                    % after being blocked by a future.  It is also used to
+                    % determine how quickly the context executing
+                    % MR_join_and_continue after completing the leftmost
+                    % conjunct of a parallel conjunction can recover after
+                    % being blocked on the completion of one of the other
+                    % conjuncts.
+
+                cpcp_clique_threshold       :: int,
+                    % The cost threshold in call sequence counts of a clique
+                    % before it is considered for parallel execution.
+
+                cpcp_call_site_threshold    :: int,
+                    % The cost threshold in call sequence counts of a call site
+                    % before it is considered for parallel execution.
+
+                cpcp_parallelise_dep_conjs  :: parallelise_dep_conjs
+                    % Whether we will allow parallelisation to result in
+                    % dependant parallel conjunctions.
+            ).
+
+:- type parallelise_dep_conjs
+    --->    parallelise_dep_conjs
+    ;       do_not_parallelise_dep_conjs.
+
     % The set of candidate parallel conjunctions within a procedure.
     %
 :- type candidate_par_conjunctions_proc(GoalType)
@@ -226,7 +265,7 @@
 :- type parallel_exec_metrics_incomplete.
     
     % ParMetrics = init_parallel_exec_metrics_incomplete(PartMetricsA,
-    %   CostBSeq, CostBPar) 
+    %   TimeSignal, TimeBSeq, TimeBPar) 
     %
     % Use this function to build parallel execution metrics for a parallel
     % conjunction of any size greater than one.
@@ -240,19 +279,20 @@
     % variables in A.
     %
 :- func init_parallel_exec_metrics_incomplete(parallel_exec_metrics_incomplete,
-    float, float) = parallel_exec_metrics_incomplete.
+    float, float, float) = parallel_exec_metrics_incomplete.
 
-    % StartMetrics = init_empty_parallel_exec_metrics(CostBefore).
+    % StartMetrics = init_empty_parallel_exec_metrics(CostBefore, NumCalls,
+    %   SparkCost, SparkDelay, ContextWakeupDelay).
     %
     % Use this function to start with an empty set of metrics for an empty
     % conjunction.  Then use init_parallel_exec_metrics_incomplete to continue
     % adding conjuncts on the right.
     %
-:- func init_empty_parallel_exec_metrics(float) = 
+:- func init_empty_parallel_exec_metrics(float, int, float, float, float) = 
     parallel_exec_metrics_incomplete.
 
-    % Metrics = finalise_parallel_exec_metrics(IncompleteMetrics, NumCalls,
-    %   CostAfterConj, RightConjDelay).:w
+    % Metrics = finalise_parallel_exec_metrics(IncompleteMetrics,
+    %   CostAfterConj).
     %
     % Make the metrics structure complete.
     %
@@ -260,8 +300,8 @@
     % begin executing.  & is considered to be right-associative since that's
     % how sparks are sparked.
     %
-:- func finalise_parallel_exec_metrics(parallel_exec_metrics_incomplete, 
-    int, float, float) = parallel_exec_metrics.
+:- func finalise_parallel_exec_metrics(parallel_exec_metrics_incomplete, float)
+    = parallel_exec_metrics.
 
 :- func parallel_exec_metrics_get_num_calls(parallel_exec_metrics) = int.
 
@@ -410,9 +450,11 @@
 
 :- implementation.
 
+:- import_module bool.
 :- import_module exception.
 :- import_module float.
 :- import_module map.
+:- import_module maybe.
 :- import_module require.
 :- import_module svmap.
 :- import_module unit.
@@ -478,7 +520,7 @@ put_feedback_data(Data, !Info) :-
 feedback_data_type(feedback_type_calls_above_threshold_sorted,
     feedback_data_calls_above_threshold_sorted(_, _, _)).
 feedback_data_type(feedback_type_candidate_parallel_conjunctions,
-    feedback_data_candidate_parallel_conjunctions(_, _, _, _, _)).
+    feedback_data_candidate_parallel_conjunctions(_, _)).
 
 :- pred feedback_data_mismatch_error(string::in, feedback_type::in, 
     feedback_data::in) is erroneous.
@@ -753,7 +795,7 @@ feedback_first_line = "Mercury Compiler 
 
 :- func feedback_version = string.
 
-feedback_version = "9".
+feedback_version = "10".
 
 %-----------------------------------------------------------------------------%
 %
@@ -783,23 +825,55 @@ convert_seq_conj(Conv, seq_conj(Conjs0),
 
 :- type parallel_exec_metrics
     --->    parallel_exec_metrics(
-                pem_inner_metrics       :: parallel_exec_metrics_incomplete,
+                pem_inner_metrics           :: parallel_exec_metrics_internal,
                 pem_num_calls           :: int,
                 pem_time_before_conj    :: float,
                 pem_time_after_conj     :: float,
-                pem_right_conj_delay    :: float
+                pem_left_conj_cost          :: float,
+                    % The cost of calling fork() in the conjunct to the left of
+                    % a & symbol.
+                pem_right_conj_delay        :: float,
                     % The delay before a conjunct to the right of & begins
                     % executing.
+
+                pem_context_wakeup_delay    :: float
             ).
 
 :- type parallel_exec_metrics_incomplete
-    --->    pem_initial(
-                pemi_time_before_conj    :: float
+    --->    pem_incomplete(
+                pemi_time_before_conj       :: float,
+
+                pemi_num_calls              :: int,
+
+                pemi_spark_cost             :: float,
+
+                pemi_spark_delay            :: float,
+
+                pemi_context_wakeup_delay   :: float,
+
+                pemi_internal               ::
+                        maybe(parallel_exec_metrics_internal)
+                    % If there are no internal conjuncts then the parallel
+                    % conjunction is empty.
+            ).
+
+:- type parallel_exec_metrics_internal
+    --->    pem_left_most(
+                pemi_time_seq               :: float,
+                pemi_time_par               :: float
             )
     ;       pem_additional(
-                pemi_time_left           :: parallel_exec_metrics_incomplete,
+                pemi_time_left              :: parallel_exec_metrics_internal,
                     % The time of the left conjunct (that may be a conjunction),
 
+                pemi_time_left_signals      :: float,
+                    % The additional cost of calling signal within the left
+                    % conjunct.
+                    % NOTE: Note that this should be added to each of the
+                    % individual conjuncts _where_ they call signal but thta is
+                    % more difficult and may not be required.  We may visit it
+                    % in the future.
+
                 pemi_time_right_seq      :: float,
                     % The time of the right conjunct if it is running after
                     % the left in normal sequential execution.
@@ -812,39 +886,55 @@ convert_seq_conj(Conv, seq_conj(Conjs0),
                     % parallel execution overheads and delays.
             ).
 
-init_parallel_exec_metrics_incomplete(MetricsA, TimeBSeq, TimeBPar) = 
-    pem_additional(MetricsA, TimeBSeq, TimeBPar).
+init_parallel_exec_metrics_incomplete(Metrics0, TimeSignals, TimeBSeq, 
+        TimeBPar) = Metrics :-
+    MaybeInternal0 = Metrics0 ^ pemi_internal,
+    (
+        MaybeInternal0 = yes(Internal0),
+        Internal = pem_additional(Internal0, TimeSignals, TimeBSeq, TimeBPar)
+    ;
+        MaybeInternal0 = no,
+        Internal = pem_left_most(TimeBSeq, TimeBPar),
+        require(unify(TimeSignals, 0.0),
+            this_file ++ "TimeSignal != 0")
+    ),
+    Metrics = Metrics0 ^ pemi_internal := yes(Internal).
 
-init_empty_parallel_exec_metrics(TimeBefore) = pem_initial(TimeBefore).
+init_empty_parallel_exec_metrics(TimeBefore, NumCalls, SparkCost, 
+        SparkDelay, ContextWakeupDelay) = 
+    pem_incomplete(TimeBefore, NumCalls, SparkCost, SparkDelay,
+        ContextWakeupDelay, no).
 
-finalise_parallel_exec_metrics(IncompleteMetrics, NumCalls, TimeAfter,
-        RightConjDelay) 
-    =
-        parallel_exec_metrics(IncompleteMetrics, NumCalls, TimeBefore,
-        TimeAfter, RightConjDelay) :-
-    TimeBefore = 
-        parallel_exec_metrics_incomp_get_time_before(IncompleteMetrics).
-
-:- func parallel_exec_metrics_incomp_get_time_before(
-    parallel_exec_metrics_incomplete) = float.
-
-parallel_exec_metrics_incomp_get_time_before(pem_initial(Time)) = Time.
-parallel_exec_metrics_incomp_get_time_before(
-        pem_additional(Left, _, _)) = Time :-
-    Time = parallel_exec_metrics_incomp_get_time_before(Left).
+finalise_parallel_exec_metrics(IncompleteMetrics, TimeAfter) = Metrics :-
+    IncompleteMetrics = pem_incomplete(TimeBefore, NumCalls, SparkCost,
+        SparkDelay, ContextWakeupDelay, MaybeInternal),
+    (
+        MaybeInternal = yes(Internal)
+    ;
+        MaybeInternal = no,
+        error(this_file ++ "Cannot finalise empty parallel metrics.")
+    ),
+    Metrics = parallel_exec_metrics(Internal, NumCalls, TimeBefore, TimeAfter,
+        SparkCost, SparkDelay, ContextWakeupDelay).
 
 parallel_exec_metrics_get_num_calls(PEM) = NumCalls :-
     NumCalls = PEM ^ pem_num_calls.
 
 parallel_exec_metrics_get_par_time(PEM) = Time :-
     Inner = PEM ^ pem_inner_metrics,
-    InnerTime = parallel_exec_metrics_incomp_get_par_time(Inner),
+    InnerTime = parallel_exec_metrics_internal_get_par_time(Inner),
+    FirstConjTime = pem_get_first_conj_time(Inner),
     BeforeAndAfterTime = PEM ^ pem_time_before_conj + PEM ^ pem_time_after_conj,
-    Time = InnerTime + BeforeAndAfterTime.
+    ( FirstConjTime < InnerTime ->
+        FirstConjWakeupPenalty = PEM ^ pem_context_wakeup_delay
+    ;
+        FirstConjWakeupPenalty = 0.0
+    ),
+    Time = InnerTime + BeforeAndAfterTime + FirstConjWakeupPenalty.
 
 parallel_exec_metrics_get_seq_time(PEM) = Time :- 
     Inner = PEM ^ pem_inner_metrics,
-    InnerTime = parallel_exec_metrics_incomp_get_seq_time(Inner),
+    InnerTime = parallel_exec_metrics_internal_get_seq_time(Inner),
     BeforeAndAfterTime = PEM ^ pem_time_before_conj + PEM ^ pem_time_after_conj,
     Time = InnerTime + BeforeAndAfterTime.
 
@@ -854,24 +944,25 @@ parallel_exec_metrics_get_speedup(Metric
 
     % The expected parallel execution time.
     %
-:- func parallel_exec_metrics_incomp_get_par_time(
-    parallel_exec_metrics_incomplete) = float.
+:- func parallel_exec_metrics_internal_get_par_time(
+    parallel_exec_metrics_internal) = float.
 
-parallel_exec_metrics_incomp_get_par_time(pem_initial(_)) = 0.0.
-parallel_exec_metrics_incomp_get_par_time(pem_additional(MetricsLeft, _, TimeRight)) 
-        = Time :-
-    TimeLeft = parallel_exec_metrics_incomp_get_par_time(MetricsLeft),
+parallel_exec_metrics_internal_get_par_time(pem_left_most(_, Time)) = Time.
+parallel_exec_metrics_internal_get_par_time(pem_additional(MetricsLeft,
+        TimeLeftSignal, _, TimeRight)) = Time :-
+    TimeLeft = parallel_exec_metrics_internal_get_par_time(MetricsLeft) +
+        TimeLeftSignal,
     Time = max(TimeLeft, TimeRight).
 
     % The expected sequential execution time.
     %
-:- func parallel_exec_metrics_incomp_get_seq_time(
-    parallel_exec_metrics_incomplete) = float.
+:- func parallel_exec_metrics_internal_get_seq_time(
+    parallel_exec_metrics_internal) = float.
 
-parallel_exec_metrics_incomp_get_seq_time(pem_initial(_)) = 0.0.
-parallel_exec_metrics_incomp_get_seq_time(pem_additional(MetricsLeft, TimeRight, _)) 
-        = Time :-
-    TimeLeft = parallel_exec_metrics_incomp_get_seq_time(MetricsLeft),
+parallel_exec_metrics_internal_get_seq_time(pem_left_most(Time, _)) = Time.
+parallel_exec_metrics_internal_get_seq_time(pem_additional(MetricsLeft, _,
+        TimeRight, _)) = Time :-
+    TimeLeft = parallel_exec_metrics_internal_get_seq_time(MetricsLeft),
     Time = TimeLeft + TimeRight.
 
 parallel_exec_metrics_get_time_saving(Metrics) = SeqTime - ParTime :-
@@ -881,51 +972,49 @@ parallel_exec_metrics_get_time_saving(Me
 parallel_exec_metrics_get_first_conj_dead_time(Metrics) = DeadTime :-
     Inner = Metrics ^ pem_inner_metrics,
     FirstConjTime = pem_get_first_conj_time(Inner),
-    MaxConjTime = pem_get_max_conj_time(Inner, 0.0),
-    DeadTime = MaxConjTime - FirstConjTime.
+    ParTime = parallel_exec_metrics_get_par_time(Metrics),
+    DeadTime = ParTime - FirstConjTime.
 
-:- func pem_get_first_conj_time(parallel_exec_metrics_incomplete) = float.
+    % Get the parallel execution time of the first conjunct.  This is used for
+    % calculating the first conjunct's dead time (above).
+    %
+:- func pem_get_first_conj_time(parallel_exec_metrics_internal) = float.
 
-pem_get_first_conj_time(pem_initial(_)) = _ :- 
-    error("pem_get_first_conj_time: Empty conjunction").
-pem_get_first_conj_time(pem_additional(Left, _RightSeq, RightPar)) = Time :-
+pem_get_first_conj_time(pem_left_most(_, Time)) = Time.
+pem_get_first_conj_time(pem_additional(Left, LeftSignalTime0, _, _)) = Time :-
     (
-        Left = pem_initial(_),
-        Time = RightPar
+        Left = pem_left_most(_, _),
+        LeftSignalTime = LeftSignalTime0
     ;
-        Left = pem_additional(_, _, _),
-        Time = pem_get_first_conj_time(Left)
-    ).
-
-:- func pem_get_max_conj_time(parallel_exec_metrics_incomplete, float) = float.
-
-pem_get_max_conj_time(pem_initial(_), Max) = Max.
-pem_get_max_conj_time(pem_additional(Left, _, Par), Max0) = Max :-
-    Max1 = max(Par, Max0),
-    Max = pem_get_max_conj_time(Left, Max1).
+        Left = pem_additional(_, _, _, _),
+        LeftSignalTime = 0.0
+    ),
+    Time = pem_get_first_conj_time(Left) + LeftSignalTime.
 
 parallel_exec_metrics_get_future_dead_time(Metrics) = DeadTime :-
     Inner = Metrics ^ pem_inner_metrics,
     RightConjDelay = Metrics ^ pem_right_conj_delay,
-    DeadTime = pem_get_future_dead_time(Inner, RightConjDelay).
-
-:- func pem_get_future_dead_time(parallel_exec_metrics_incomplete, float) 
-    = float.
-
-pem_get_future_dead_time(pem_initial(_), _) = 0.0.
-pem_get_future_dead_time(pem_additional(Left, Seq, Par), Delay) = DeadTime :-
+    LeftConjCost = Metrics ^ pem_left_conj_cost,
+    DeadTime = pem_get_future_dead_time(Inner, yes, LeftConjCost,
+        RightConjDelay).
+
+:- func pem_get_future_dead_time(parallel_exec_metrics_internal, bool,
+    float, float) = float.
+
+    % XXX: Delays may new be build into the times.
+pem_get_future_dead_time(pem_left_most(_, _), _, _, _) = 0.0.
+pem_get_future_dead_time(pem_additional(Left, _, Seq, Par), 
+        IsRightmostConj, ForkCost, ForkDelay) = DeadTime :-
     DeadTime = ThisDeadTime + LeftDeadTime,
-    % Only use the delay if this conjunction contains some code in it's left
-    % conjunct.
+    ThisDeadTime0 = Par - Seq - ForkDelay,
     (
-        Left = pem_initial(_),
-        ThisDelay = 0.0
+        IsRightmostConj = yes,
+        ThisDeadTime = ThisDeadTime0
     ;
-        Left = pem_additional(_, _, _),
-        ThisDelay = Delay
+        IsRightmostConj = no,
+        ThisDeadTime = ThisDeadTime0 + ForkCost
     ),
-    ThisDeadTime = Par - Seq - ThisDelay,
-    LeftDeadTime = pem_get_future_dead_time(Left, Delay).
+    LeftDeadTime = pem_get_future_dead_time(Left, no, ForkCost, ForkDelay).
 
 parallel_exec_metrics_get_total_dead_time(Metrics) = DeadTime :-
     DeadTime = FirstConjDeadTime + FutureDeadTime,
@@ -934,5 +1023,11 @@ parallel_exec_metrics_get_total_dead_tim
     FutureDeadTime = parallel_exec_metrics_get_future_dead_time(Metrics).
 
 %-----------------------------------------------------------------------------%
+
+:- func this_file = string.
+
+this_file = "feedback.m: ".
+
+%-----------------------------------------------------------------------------%
 :- end_module mdbcomp.feedback.
 %-----------------------------------------------------------------------------%
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 489 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20100714/75d1b158/attachment.sig>


More information about the reviews mailing list