[m-rev.] diff: Document a bug in the parallel runtime system.

Paul Bone pbone at csse.unimelb.edu.au
Tue Sep 4 21:04:30 AEST 2012

Document a bug in the parallel runtime system.

I've found the cause of a long standing race condition in the parallel
runtime system.  The fix is to properly design the engine state and
notification system in the parallel runtime system.  I've worked on a design
and committed a description.  I've started implementing the changes but found
a new different race condition.  I don't currently have the time to complete
this change, therefore I will commit the documentation of the current
problem and proposed solution.

    As above, document the current race.

    Also comment on a bug where an idle engine could be notified of a
    context to execute.  This notification could be dropped in some
    situations causing the context to be lost.

    As above, document the proposed new engine state and notification

diff --git a/runtime/mercury_context.c b/runtime/mercury_context.c
index 605f54c..a736620 100644
--- a/runtime/mercury_context.c
+++ b/runtime/mercury_context.c
@@ -1505,6 +1505,38 @@ MR_schedule_context(MR_Context *ctxt)
     if (ctxt->MR_ctxt_resume_engine_required == MR_TRUE) {
         ** Only engine_id may execute this context, attempt to wake it.
+        **
+        ** Note that there is a race condition here.  If the engine that can
+        ** run this context is working, then try_wake_engine() will fail, if
+        ** it then becomes idle and checks the run queue before we acquire
+        ** the run queue lock below then it can go to sleep and won't be
+        ** notified that there is a context to execute.  The context will be
+        ** placed on the run queue awaiting the engine.  If the context can
+        ** only be executed by a single engine, then that engine will only
+        ** check the run queue if it first executes a spark, causing it to
+        ** call MR_do_idle after completing the spark.
+        **
+        ** This is only a problem for contexts that can only be executed on
+        ** a single engine.  In other causes this engine is guaranteed to
+        ** eventually call MR_do_idle and execute the context.  Potentially
+        ** causing a loss of parallelism but not a deadlock.
+        **
+        ** We can fix this race by adding an extra message, which we
+        ** tentatively call MR_ENGINE_ACTION_CONTEXT_ADVICE, which does not
+        ** contain a context but tells an engine that one is available.
+        ** After placing a context on the run queue we can deliver this
+        ** message to an idle engine that should check the run queue if it
+        ** hasn't already.  We must also guarantee that an engine checks if
+        ** it has any notifications before going into the sleeping state.
+        **
+        ** I have a workspace in which I fix these problems, however it is
+        ** buggy in other ways so I cannot commit it yet.  For now I'm
+        ** documenting it in this comment.
+        **
+        ** See runtime/design/par_engine_state.{txt,dot} for details of the
+        ** proposed changes to the engine notification code.  Although the
+        ** proposed changes in these files are much more complex than is
+        ** strictly needed, we believe that they avoid other problems.
         if (MR_debug_threads) {
@@ -1512,6 +1544,15 @@ MR_schedule_context(MR_Context *ctxt)
+        /*
+        ** There is a bug on this line, we should never give a context
+        ** notification to an idle engine as the notification can be lost if
+        ** the engine writes-over its own state and then a later
+        ** notification is passed to the engine.
+        **
+        ** However I don't want to fix it as fixing it may make the window
+        ** for the race condition documented above wider.
+        */
         if (try_wake_engine(engine_id, MR_ENGINE_ACTION_CONTEXT,
             &wake_action_data, ENGINE_STATE_IDLE | ENGINE_STATE_SLEEPING))
diff --git a/runtime/notes/engine_state.dot b/runtime/notes/engine_state.dot
new file mode 100644
index 0000000..198fb5a
--- /dev/null
+++ b/runtime/notes/engine_state.dot
@@ -0,0 +1,103 @@
+digraph G {
+//    rankdir=LR;
+    { rank=source; startup };
+    { rank=same; context; context_advice; worksteal_advice };
+    //{ rank=same; stealing; sleeping; };
+    { rank=sink; do_shutdown };
+    { rank=same; working; idle; stealing; };
+    // { rank=same; working; idle; sleeping; };
+    //subgraph cluster_working {
+        //working;
+        //idle;
+        context_advice [label="context or worksteal advice" shape=rectangle];
+        context_advice2 [label="context_advice" shape=rectangle];
+        context [shape=rectangle];
+        //worksteal [shape=rectangle];
+    //    label = "counted as working";
+    //}
+    subgraph cluster_idle {
+        stealing;
+        sleeping;
+        soft_was_stealing [label="context or worksteal advice" shape=rectangle];
+        label = "counted as idle";
+    };
+    // Styles for 'idle' states.
+    sleeping [style=filled fillcolor=lightgrey];
+    stealing [style=filled fillcolor=lightgrey];
+    // Shapes for notifications
+    shutdown [shape=rectangle];
+    context [shape=rectangle];
+    context_advice [shape=rectangle];
+    worksteal_advice [shape=rectangle];
+    // Illistrative only.
+    worksteal_once [style=dashed];
+    /*
+     * Normal operation.
+     */
+    // Primordial engine.
+    startup -> working;
+    // Worker engines.
+    startup -> idle;
+    // Finished work.
+    working -> idle [style=bold];
+    // Looking for work.
+    idle -> working [style=bold];
+    stealing -> working [style=bold];
+    // Other transitions use CAS below.
+    // Respond to notifications.
+    shutdown -> do_shutdown;
+    // We do the same action as idle does anyway.
+    context_advice -> idle;
+    worksteal_advice -> worksteal_once;
+    worksteal_once -> working;
+    worksteal_once -> stealing;
+    context -> working;
+    /* CAS Transitions */
+    edge [color=blue];
+    idle -> stealing [style=bold];
+    stealing -> sleeping;
+    /* Locked transitions from other engines */
+    edge [color=red];
+    sleeping -> context;
+    sleeping -> context_advice;
+    sleeping -> worksteal_advice;
+    sleeping -> shutdown;
+    /* CAS transitions from other engines */
+    edge [color=green];
+    idle -> context_advice;
+    idle -> worksteal_advice;
+    stealing -> context_advice;
+    stealing -> worksteal_advice;
+    // safe because there will be no other work
+    idle -> shutdown;
+    stealing -> shutdown;
diff --git a/runtime/notes/par_engine_state.txt b/runtime/notes/par_engine_state.txt
new file mode 100644
index 0000000..e32d0e0
--- /dev/null
+++ b/runtime/notes/par_engine_state.txt
@@ -0,0 +1,120 @@
+THere is one definite problem with the current implementation and several
+potential ones.  I'm in the process of testing the following proposal.
+Engine states and notifications
+An engine may be in one of the following states, see the es_state field
+working      The engine has work to do and is working on it.
+             The engine will not check for notifications, all
+             notifications will be ignored.
+idle         The engine finished its work and is looking for
+             more work.  It is looking for a context to resume or a local
+             spark.  If found, the engine will move to the working state,
+             if not, it will check for notifications and if there are
+             none it moves to the stealing state.  Only notify an idle
+             engine with notifications that may be ignored.
+stealing     The engine is now attempting to work steal.  It has now
+             incremented the idle engine count to make it easier to
+             receive notifications.  If it finds a spark it will decrement
+             the count and execute the spark.  Otherwise it checks for
+             notifications and moves to the sleeping state.  This state
+             is similar to idle but separate as it allows another engine
+             to understand if this engine has modified the idle engine
+             count (which we don't want to do in the idle state as that
+             will often find a local spark to execute).
+sleeping     The engine has committed to going to sleep, to wake it up
+             one must post to its sleep semaphore ensuring that it does
+             not sleep.  Any notification can be sent at this stage as
+             all will be acted upon, including the context notification
+             which cannot be dropped.
+             The engine has received a notification, it cannot receive
+             another notification now.  This state is initiated by the
+             notifier, and therefore is done with either a compare and
+             swap or a lock depending on the state of the engine.  See
+             try_wake_engine and try_notify_engine.  Upon receiving the
+             notification the engine will set its new status
+             appropriately.
+An engine can move itself through the following transitions of states
+without locking or other protection.
+working -> idle
+idle -> working
+stealing -> working
+                     As the engine starts and finishes work it moves
+                     between these states with a minimum of overhead.
+                     These transitions may be made without a CAS or
+                     locking.  We simply use write ordering to guarantee
+                     that the new state (such as idle) is visible before
+                     the engine acquires the runqueue lock.
+notified -> working An engine wakes up, and finds work.
+notified -> idle
+notified -> stealing
+                     An engine wakes up but doesn't find work, it goes
+                     idle and checks for global work.
+An engine can move itself through the following transitions provided that
+it uses a CAS to do so.  This is so that it is guaranteed to observe the
+notified state if another engine has set that state.
+idle -> stealing
+                     About to attempt work stealing.
+stealing -> sleeping
+                     The engine is about to call sem_wait, and MUST call
+                     sem_wait after advertising the sleeping state.
+A notifier may notify another engine with the following transitions.
+sleeping -> notified
+                     Wake an engine while holding the wake_lock, the
+                     engine must also post to the sleep semaphore and
+                     decrement the idle engines count.  See
+                     try_wake_engine.
+idle -> notified
+stealing -> notified
+                     Notify an engine of an event.  This must use a
+                     CAS, so that it coordinates with the engine's own
+                     CAS transitions.
+See also par_engine_state.dot for a graph of these states and transitions.
+The RTS can run in a polling mode where sem_timedwait is used instead of
+sem_wait,  Define MR_WORKSTEAL_POLLING to enable this, it must also be
+defined when compiling the application as the mercury_context.h will
+define different macros.  When running in this mode an engine may move
+from the sleeping to working states itself by using the lock in its sleep
+This has been setup specifically to ensure that engines can be notified
+individually and that work is never lost.  Any future changes must
+continue to prevent the following races.
+There are two engines, A becomes idle and checks for contexts to
+execute, then B schedules a new context.  He cannot give it directly to A
+because A is not sleeping.  So A never sees the context.  Worse
+still, if this context can only be executed by A then B never continues
+this work and the whole system deadlocks.  Therefore after placing the
+context on the runqueue, a context advice message is given to any
+engine that is in the idle, stealing or sleeping states (currently only
+for cases where the context may only be executed by a single engine).
+Similarly, if engine A is creating a spark, and engine B is in the
+stealing state may have already checked A's deque.  So it's a good idea
+to notify an engine of a spark if it is in the stealing or sleeping
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20120904/bac9e035/attachment.sig>

More information about the reviews mailing list