[m-rev.] For post-commit review: Update granularity control.

Paul Bone pbone at csse.unimelb.edu.au
Fri Oct 8 10:40:46 AEDT 2010
Previous message: [m-rev.] diff: make some java classes serializable
Next message: [m-rev.] diff: trivial C# foreign procs
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
For post-commit review bo Zoltan.

---

Update granularity control to ensure that it works with the current runtime
system.

Granularity control now uses the length of a contexts run queue as the measure
of how busy the system is and whether it should fork off work.  It is now
configured at runtime rather than compile time and therefore the
--parallelism-target option has been removed from the compiler.

Running some simple tests shows that granularity control has little effect on
most programs.  The effect is probably negligible on programs that use few,
large grains of parallelism.  On programs that represent pathological cases
such as parallel naive Fibonacci granularity control has a significant affect.
Parallel Fibonacci runs roughly four times faster than sequential Fibonacci on
an eight core machine.  But ten times slower if granularity control is
disabled.

Granularity control slightly improves the performance of very-dependant and
parallelism.  However the sequential versions of these programs are faster as
there is close to zero 'parallel overlap'.

These tests where informal, more formal testing is required, especially for
tuning.

compiler/granularity.m:
    Updated granularity control to use a new macro in the runtime to test if a
    new task should be spawned.

    Use a runtime option to tune runtime granularity rather than a compile time
    option.

    Mark the runtime test as thread safe to avoid locking - which is
    unnecessary.

compiler/options.m:
    Removed --parallelism-target compilation option.  Granularity control is
    now configured at run-time.

runtime/mercury_wrapper.h:
    Create a two new global variable MR_granularity_wsdeque_length and
    MR_granularity_wsdeque_length_factor.  MR_granularity_wsdeque_length is
    MR_granularity_wsdeque_length_factor * MR_num_threads.
    MR_granularity_wsdeque_length_factor and MR_engines are both configurable via
    the MERCURY_OPTIONS environment.

    This test calculates the length of the wsdeque each time.  A comment is
    provided to justify this design.

runtime/mercury_wrapper.c:
    Initialise MR_granularity_wsdeque_length during startup of the runtime.

    Parse the new runtime option --runtime-granularity-wsdeque-length-factor
    The default value for this option is 8, this has been chosen somewhat
    arbitrarily.  IN the future we should test the affects of different values
    of this option.

runtime/mercury_context.h:
    Implement a new granularity control test that is linked to the length of a
    local thread's run queue.  The test compares the length of the queue to
    MR_granularity_wsdeque_length.

runtime/mercury_context.c:
    re-initialisation MR_granularity_wsdeque_length after auto-detection of the
    MR_num_threads.

runtime/mercury_wsdeque.h:
    Provide a new inline function to get the length of a wsdeque.

doc/user_guide.texi:
    Updated documentation to reflect changes to compiler and runtime options.

    The new runtime option's documentation is commented out, it is intended for
    developers who understand it's operational semantics.

Index: compiler/granularity.m
===================================================================
RCS file: /home/mercury1/repository/mercury/compiler/granularity.m,v
retrieving revision 1.17
diff -u -p -b -r1.17 granularity.m
--- compiler/granularity.m	16 Sep 2010 00:39:03 -0000	1.17
+++ compiler/granularity.m	7 Oct 2010 23:20:15 -0000
@@ -112,12 +112,8 @@ runtime_granularity_test_in_goal(Goal0, 
             ;
                 CalledSCCPredProcIds = [_ | _],
                 ProcName = "evaluate_parallelism_condition",
-                globals.lookup_int_option(Globals, parallelism_target,
-                    NumCPUs),
-                NumCPUsStr = string.int_to_string(NumCPUs),
                 Code = "SUCCESS_INDICATOR = " ++
-                    "MR_par_cond_contexts_and_global_sparks_vs_num_cpus(" ++
-                    NumCPUsStr ++ ");",
+                    "MR_par_cond_local_wsdeque_length;",
                 Args = [],
                 ExtraArgs = [],
                 MaybeRuntimeCond = no,
@@ -125,6 +121,7 @@ runtime_granularity_test_in_goal(Goal0, 
                 Context = goal_info_get_context(GoalInfo),
                 some [!Attributes] (
                     !:Attributes = default_attributes(lang_c),
+                    set_thread_safe(proc_thread_safe, !Attributes),
                     set_purity(purity_impure, !Attributes),
                     set_may_call_mercury(proc_will_not_call_mercury,
                         !Attributes),
Index: compiler/options.m
===================================================================
RCS file: /home/mercury1/repository/mercury/compiler/options.m,v
retrieving revision 1.678
diff -u -p -b -r1.678 options.m
--- compiler/options.m	6 Oct 2010 04:01:32 -0000	1.678
+++ compiler/options.m	7 Oct 2010 23:20:15 -0000
@@ -677,7 +677,6 @@
     ;       allow_some_paths_only_waits
     ;       control_granularity
     ;       distance_granularity
-    ;       parallelism_target
     ;       implicit_parallelism
     ;       old_implicit_parallelism
             % implicit_parallelism_old enables Jerome's implementation,
@@ -1547,7 +1546,6 @@ option_defaults_2(optimization_option, [
     allow_some_paths_only_waits         -   bool(yes),
     control_granularity                 -   bool(no),
     distance_granularity                -   int(0),
-    parallelism_target                  -   int(4),
     implicit_parallelism                -   bool(no),
     old_implicit_parallelism            -   bool(no),
     region_analysis                     -   bool(no),
@@ -2437,7 +2435,6 @@ long_option("allow-some-paths-only-waits
                                     allow_some_paths_only_waits).
 long_option("control-granularity",  control_granularity).
 long_option("distance-granularity", distance_granularity).
-long_option("parallelism-target",   parallelism_target).
 long_option("implicit-parallelism", implicit_parallelism).
 long_option("old-implicit-parallelism", old_implicit_parallelism).
 
@@ -4984,10 +4981,8 @@ options_help_hlds_hlds_optimization -->
 %       "\tversion of the called procedure, even if this is not profitable.",
         "--control-granularity",
         "\tDon't try to generate more parallelism than the machine can",
-        "\thandle, which is specified using --parallelism-target.",
-        "--parallelism-target <n>",
-        "\tSpecify the number of CPUs of the target machine, for use by",
-        "\tthe --control-granularity option.",
+        "\thandle, which may be specified at runtime or detected",
+        "\tautomatically.",
         "--distance-granularity <distance>",
         "\tControl the granularity of parallel execution using the",
         "\tspecified distance value.", 
Index: doc/user_guide.texi
===================================================================
RCS file: /home/mercury1/repository/mercury/doc/user_guide.texi,v
retrieving revision 1.613
diff -u -p -b -r1.613 user_guide.texi
--- doc/user_guide.texi	6 Oct 2010 04:01:32 -0000	1.613
+++ doc/user_guide.texi	7 Oct 2010 23:28:33 -0000
@@ -8733,7 +8733,9 @@ This information is used to reduce the o
 @item --control-granularity
 @findex --control-granularity
 Don't try to generate more parallelism than the machine can handle,
-which is specified using --parallelism-target.
+which may be specified at runtime or detected automatically.
+(see the @samp{-P} option in the @samp{MERCURY_OPTIONS} environment 
+variable.)
 
 @sp 1
 @item --distance-granularity @var{distance_value}
@@ -9909,6 +9911,14 @@ This only has an effect if the executabl
 grade.
 
 @c @sp 1
+ at c @item --runtime-granularity-wsdeque-length-factor @var{factor}
+ at c @findex --runtime-granularity-wsdeque-length-factor (runtime option)
+ at c Configures the runtime granularity control method not to create sparks if a
+ at c context's local spark wsdeque is longer than 
+ at c @math{ @var{factor} * @var{num_engines}}.
+ at c @var{num_engines} is configured with the @samp{-P} runtime option.
+ at c 
+ at c @sp 1
 @c @item --profile-parallel-execution
 @c @findex --profile-parallel-execution
 @c Tells the runtime to collect and write out parallel execution profiling
Index: runtime/mercury_context.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_context.c,v
retrieving revision 1.81
diff -u -p -b -r1.81 mercury_context.c
--- runtime/mercury_context.c	31 May 2010 09:41:47 -0000	1.81
+++ runtime/mercury_context.c	7 Oct 2010 23:20:15 -0000
@@ -248,6 +248,9 @@ MR_init_thread_stuff(void)
         MR_num_threads = 1;
 #endif /* ! defined(MR_HAVE_SYSCONF) && defined(_SC_NPROCESSORS_ONLN) */ 
     }
+#ifdef MR_LL_PARALLEL_CONJ
+    MR_granularity_wsdeque_length = MR_granularity_wsdeque_length_factor * MR_num_threads;
+#endif
 #endif /* MR_THREAD_SAFE */
 }
 
Index: runtime/mercury_context.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_context.h,v
retrieving revision 1.60
diff -u -p -b -r1.60 mercury_context.h
--- runtime/mercury_context.h	20 Mar 2010 10:15:51 -0000	1.60
+++ runtime/mercury_context.h	7 Oct 2010 23:20:15 -0000
@@ -785,6 +785,23 @@ extern  void        MR_schedule_context(
   #define MR_par_cond_contexts_and_all_sparks_vs_num_cpus(target_cpus)        \
       (MR_num_outstanding_contexts_and_all_sparks < target_cpus)
 
+  /*
+  ** This test calculates the length of a wsdeque each time it is called.
+  ** The test will usually execute more often than the length of the
+  ** queue changes.  Therefore, it makes sense to update a protected counter
+  ** each time a spark is pushed, popped or stolen from the queue.  However I
+  ** believe that these atomic operations could be more expensive than
+  ** necessary.
+  **
+  ** The current implementation computes the length of the queue each time this
+  ** macro is evaluated, this requires no atomic operations and contains only
+  ** one extra memory dereference whose cache line is probably already hot in
+  ** the first-level cache.
+  */
+  #define MR_par_cond_local_wsdeque_length                                    \
+      (MR_wsdeque_length(&MR_ENGINE(MR_eng_this_context)->MR_ctxt_spark_deque) < \
+        MR_granularity_wsdeque_length)
+
 extern MR_Code* 
 MR_do_join_and_continue(MR_SyncTerm *sync_term, MR_Code *join_label);
 
Index: runtime/mercury_wrapper.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_wrapper.c,v
retrieving revision 1.209
diff -u -p -b -r1.209 mercury_wrapper.c
--- runtime/mercury_wrapper.c	26 May 2010 07:45:49 -0000	1.209
+++ runtime/mercury_wrapper.c	7 Oct 2010 23:20:15 -0000
@@ -309,6 +309,11 @@ static  int         MR_num_output_args =
 */
 MR_Unsigned         MR_num_threads = 0;
 
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ)
+MR_Unsigned         MR_granularity_wsdeque_length_factor = 8;
+MR_Unsigned         MR_granularity_wsdeque_length = 0;
+#endif
+
 static  MR_bool     MR_print_table_statistics = MR_FALSE;
 
 /* timing */
@@ -602,6 +607,9 @@ mercury_runtime_init(int argc, char **ar
     /* MR_init_thread_stuff() must be called prior to MR_init_memory() */
     MR_init_thread_stuff();
     MR_max_outstanding_contexts = MR_max_contexts_per_thread * MR_num_threads;
+#ifdef MR_LL_PARALLEL_CONJ
+    MR_granularity_wsdeque_length = MR_granularity_wsdeque_length_factor * MR_num_threads;
+#endif
     MR_primordial_thread = pthread_self();
 #endif
 
@@ -1266,6 +1274,7 @@ enum MR_long_option {
     MR_GEN_NONDETSTACK_REDZONE_SIZE,
     MR_GEN_NONDETSTACK_REDZONE_SIZE_KWORDS,
     MR_MAX_CONTEXTS_PER_THREAD,
+    MR_RUNTIME_GRANULAITY_WSDEQUE_LENGTH_FACTOR,
     MR_WORKSTEAL_MAX_ATTEMPTS,
     MR_WORKSTEAL_SLEEP_MSECS,
     MR_THREAD_PINNING,
@@ -1367,6 +1376,8 @@ struct MR_option MR_long_opts[] = {
     { "gen-nondetstack-zone-size-kwords",
         1, 0, MR_GEN_NONDETSTACK_REDZONE_SIZE_KWORDS },
     { "max-contexts-per-thread",        1, 0, MR_MAX_CONTEXTS_PER_THREAD },
+    { "runtime-granularity-wsdeque-length-factor", 1, 0, 
+        MR_RUNTIME_GRANULAITY_WSDEQUE_LENGTH_FACTOR },
     { "worksteal-max-attempts",         1, 0, MR_WORKSTEAL_MAX_ATTEMPTS },
     { "worksteal-max-attempts",         1, 0, MR_WORKSTEAL_SLEEP_MSECS },
     { "thread-pinning",                 0, 0, MR_THREAD_PINNING },
@@ -1784,6 +1795,19 @@ MR_process_options(int argc, char **argv
                 MR_max_contexts_per_thread = size;
                 break;
 
+            case MR_RUNTIME_GRANULAITY_WSDEQUE_LENGTH_FACTOR:
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ)
+                if (sscanf(MR_optarg, "%lu",
+                        &MR_granularity_wsdeque_length_factor) != 1)
+                {
+                    MR_usage();
+                }
+                if (MR_granularity_wsdeque_length_factor < 1) {
+                    MR_usage();
+                }
+#endif
+                break;
+
             case MR_WORKSTEAL_MAX_ATTEMPTS:
 #ifdef MR_LL_PARALLEL_CONJ
                 if (sscanf(MR_optarg, "%lu", &MR_worksteal_max_attempts) != 1) {
Index: runtime/mercury_wrapper.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_wrapper.h,v
retrieving revision 1.83
diff -u -p -b -r1.83 mercury_wrapper.h
--- runtime/mercury_wrapper.h	15 Dec 2009 02:29:07 -0000	1.83
+++ runtime/mercury_wrapper.h	7 Oct 2010 23:20:15 -0000
@@ -262,6 +262,21 @@ extern	MR_Unsigned MR_worksteal_sleep_ms
 
 extern  MR_Unsigned MR_num_threads;
 
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ)
+/*
+** This is used to set MR_granularity_wsdeque_length based on the value of
+** MR_num_threads.  A value of 2 says, allow twice as many threads in a
+** context's wsdeque than mercury engines before granularity control has an
+** effect.
+*/
+extern MR_Unsigned MR_granularity_wsdeque_length_factor;
+
+/*
+** The length of a context's wsdeque before granularity control has an effect.
+*/
+extern MR_Unsigned MR_granularity_wsdeque_length;
+#endif
+
 /* file names for the mdb debugging streams */
 extern	const char	*MR_mdb_in_filename;
 extern	const char	*MR_mdb_out_filename;
Index: runtime/mercury_wsdeque.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_wsdeque.h,v
retrieving revision 1.3
diff -u -p -b -r1.3 mercury_wsdeque.h
--- runtime/mercury_wsdeque.h	20 Mar 2010 10:15:51 -0000	1.3
+++ runtime/mercury_wsdeque.h	7 Oct 2010 23:20:15 -0000
@@ -97,6 +97,14 @@ extern  int     MR_wsdeque_take_top(MR_S
 extern  MR_SparkArray * MR_grow_spark_array(const MR_SparkArray *old_arr,
                             MR_Integer bot, MR_Integer top);
 
+/*
+** Return the current length of the dequeue.
+**
+** This is safe from the owner's perspective.
+*/
+MR_INLINE int
+MR_wsdeque_length(MR_SparkDeque *dq);
+
 /*---------------------------------------------------------------------------*/
 
 MR_INLINE void
@@ -154,6 +162,20 @@ MR_wsdeque_pop_bottom(MR_SparkDeque *dq,
     return success;
 }
 
+MR_INLINE int
+MR_wsdeque_length(MR_SparkDeque *dq)
+{
+    int length;
+    int top;
+    int bot;
+
+    top = dq->MR_sd_top;
+    bot = dq->MR_sd_bottom;
+    length = bot - top;
+
+    return length;
+}
+
 #endif /* !MR_LL_PARALLEL_CONJ */
 
 #endif /* !MERCURY_WSDEQUE_H */
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20101008/fd0d9c91/attachment.sig>
Previous message: [m-rev.] diff: make some java classes serializable
Next message: [m-rev.] diff: trivial C# foreign procs
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list