[m-rev.] diff: Document a bug in the parallel runtime system.
Paul Bone
pbone at csse.unimelb.edu.au
Tue Sep 4 21:04:30 AEST 2012
Document a bug in the parallel runtime system.
I've found the cause of a long standing race condition in the parallel
runtime system. The fix is to properly design the engine state and
notification system in the parallel runtime system. I've worked on a design
and committed a description. I've started implementing the changes but found
a new different race condition. I don't currently have the time to complete
this change, therefore I will commit the documentation of the current
problem and proposed solution.
runtime/mercury_context.c:
As above, document the current race.
Also comment on a bug where an idle engine could be notified of a
context to execute. This notification could be dropped in some
situations causing the context to be lost.
runtime/notes/par_engine_state.txt:
runtime/notes/par_engine_state.dot:
As above, document the proposed new engine state and notification
design.
diff --git a/runtime/mercury_context.c b/runtime/mercury_context.c
index 605f54c..a736620 100644
--- a/runtime/mercury_context.c
+++ b/runtime/mercury_context.c
@@ -1505,6 +1505,38 @@ MR_schedule_context(MR_Context *ctxt)
if (ctxt->MR_ctxt_resume_engine_required == MR_TRUE) {
/*
** Only engine_id may execute this context, attempt to wake it.
+ **
+ ** Note that there is a race condition here. If the engine that can
+ ** run this context is working, then try_wake_engine() will fail, if
+ ** it then becomes idle and checks the run queue before we acquire
+ ** the run queue lock below then it can go to sleep and won't be
+ ** notified that there is a context to execute. The context will be
+ ** placed on the run queue awaiting the engine. If the context can
+ ** only be executed by a single engine, then that engine will only
+ ** check the run queue if it first executes a spark, causing it to
+ ** call MR_do_idle after completing the spark.
+ **
+ ** This is only a problem for contexts that can only be executed on
+ ** a single engine. In other causes this engine is guaranteed to
+ ** eventually call MR_do_idle and execute the context. Potentially
+ ** causing a loss of parallelism but not a deadlock.
+ **
+ ** We can fix this race by adding an extra message, which we
+ ** tentatively call MR_ENGINE_ACTION_CONTEXT_ADVICE, which does not
+ ** contain a context but tells an engine that one is available.
+ ** After placing a context on the run queue we can deliver this
+ ** message to an idle engine that should check the run queue if it
+ ** hasn't already. We must also guarantee that an engine checks if
+ ** it has any notifications before going into the sleeping state.
+ **
+ ** I have a workspace in which I fix these problems, however it is
+ ** buggy in other ways so I cannot commit it yet. For now I'm
+ ** documenting it in this comment.
+ **
+ ** See runtime/design/par_engine_state.{txt,dot} for details of the
+ ** proposed changes to the engine notification code. Although the
+ ** proposed changes in these files are much more complex than is
+ ** strictly needed, we believe that they avoid other problems.
*/
#ifdef MR_DEBUG_THREADS
if (MR_debug_threads) {
@@ -1512,6 +1544,15 @@ MR_schedule_context(MR_Context *ctxt)
MR_SELF_THREAD_ID);
}
#endif
+ /*
+ ** There is a bug on this line, we should never give a context
+ ** notification to an idle engine as the notification can be lost if
+ ** the engine writes-over its own state and then a later
+ ** notification is passed to the engine.
+ **
+ ** However I don't want to fix it as fixing it may make the window
+ ** for the race condition documented above wider.
+ */
if (try_wake_engine(engine_id, MR_ENGINE_ACTION_CONTEXT,
&wake_action_data, ENGINE_STATE_IDLE | ENGINE_STATE_SLEEPING))
{
diff --git a/runtime/notes/engine_state.dot b/runtime/notes/engine_state.dot
new file mode 100644
index 0000000..198fb5a
--- /dev/null
+++ b/runtime/notes/engine_state.dot
@@ -0,0 +1,103 @@
+digraph G {
+// rankdir=LR;
+
+ { rank=source; startup };
+ { rank=same; context; context_advice; worksteal_advice };
+ //{ rank=same; stealing; sleeping; };
+ { rank=sink; do_shutdown };
+ { rank=same; working; idle; stealing; };
+
+ // { rank=same; working; idle; sleeping; };
+
+/*
+ //subgraph cluster_working {
+
+ //working;
+ //idle;
+ context_advice [label="context or worksteal advice" shape=rectangle];
+ context_advice2 [label="context_advice" shape=rectangle];
+ context [shape=rectangle];
+ //worksteal [shape=rectangle];
+ // label = "counted as working";
+ //}
+*/
+
+/*
+ subgraph cluster_idle {
+
+ stealing;
+ sleeping;
+ soft_was_stealing [label="context or worksteal advice" shape=rectangle];
+ label = "counted as idle";
+ };
+*/
+
+ // Styles for 'idle' states.
+ sleeping [style=filled fillcolor=lightgrey];
+ stealing [style=filled fillcolor=lightgrey];
+
+ // Shapes for notifications
+ shutdown [shape=rectangle];
+ context [shape=rectangle];
+ context_advice [shape=rectangle];
+ worksteal_advice [shape=rectangle];
+
+ // Illistrative only.
+ worksteal_once [style=dashed];
+
+ /*
+ * Normal operation.
+ */
+
+ // Primordial engine.
+ startup -> working;
+
+ // Worker engines.
+ startup -> idle;
+
+ // Finished work.
+ working -> idle [style=bold];
+
+ // Looking for work.
+ idle -> working [style=bold];
+ stealing -> working [style=bold];
+ // Other transitions use CAS below.
+
+ // Respond to notifications.
+ shutdown -> do_shutdown;
+
+ // We do the same action as idle does anyway.
+ context_advice -> idle;
+
+ worksteal_advice -> worksteal_once;
+
+ worksteal_once -> working;
+ worksteal_once -> stealing;
+
+ context -> working;
+
+ /* CAS Transitions */
+ edge [color=blue];
+ idle -> stealing [style=bold];
+ stealing -> sleeping;
+
+ /* Locked transitions from other engines */
+ edge [color=red];
+ sleeping -> context;
+ sleeping -> context_advice;
+ sleeping -> worksteal_advice;
+ sleeping -> shutdown;
+
+ /* CAS transitions from other engines */
+ edge [color=green];
+ idle -> context_advice;
+ idle -> worksteal_advice;
+ stealing -> context_advice;
+ stealing -> worksteal_advice;
+
+ // safe because there will be no other work
+ idle -> shutdown;
+ stealing -> shutdown;
+
+}
+
diff --git a/runtime/notes/par_engine_state.txt b/runtime/notes/par_engine_state.txt
new file mode 100644
index 0000000..e32d0e0
--- /dev/null
+++ b/runtime/notes/par_engine_state.txt
@@ -0,0 +1,120 @@
+
+THere is one definite problem with the current implementation and several
+potential ones. I'm in the process of testing the following proposal.
+
+
+Engine states and notifications
+-------------------------------
+
+An engine may be in one of the following states, see the es_state field
+engine_sleep_sync_i
+
+working The engine has work to do and is working on it.
+ The engine will not check for notifications, all
+ notifications will be ignored.
+
+idle The engine finished its work and is looking for
+ more work. It is looking for a context to resume or a local
+ spark. If found, the engine will move to the working state,
+ if not, it will check for notifications and if there are
+ none it moves to the stealing state. Only notify an idle
+ engine with notifications that may be ignored.
+
+stealing The engine is now attempting to work steal. It has now
+ incremented the idle engine count to make it easier to
+ receive notifications. If it finds a spark it will decrement
+ the count and execute the spark. Otherwise it checks for
+ notifications and moves to the sleeping state. This state
+ is similar to idle but separate as it allows another engine
+ to understand if this engine has modified the idle engine
+ count (which we don't want to do in the idle state as that
+ will often find a local spark to execute).
+
+sleeping The engine has committed to going to sleep, to wake it up
+ one must post to its sleep semaphore ensuring that it does
+ not sleep. Any notification can be sent at this stage as
+ all will be acted upon, including the context notification
+ which cannot be dropped.
+
+notified
+ The engine has received a notification, it cannot receive
+ another notification now. This state is initiated by the
+ notifier, and therefore is done with either a compare and
+ swap or a lock depending on the state of the engine. See
+ try_wake_engine and try_notify_engine. Upon receiving the
+ notification the engine will set its new status
+ appropriately.
+
+An engine can move itself through the following transitions of states
+without locking or other protection.
+
+working -> idle
+idle -> working
+stealing -> working
+ As the engine starts and finishes work it moves
+ between these states with a minimum of overhead.
+ These transitions may be made without a CAS or
+ locking. We simply use write ordering to guarantee
+ that the new state (such as idle) is visible before
+ the engine acquires the runqueue lock.
+
+notified -> working An engine wakes up, and finds work.
+
+notified -> idle
+notified -> stealing
+ An engine wakes up but doesn't find work, it goes
+ idle and checks for global work.
+
+An engine can move itself through the following transitions provided that
+it uses a CAS to do so. This is so that it is guaranteed to observe the
+notified state if another engine has set that state.
+
+idle -> stealing
+ About to attempt work stealing.
+
+stealing -> sleeping
+ The engine is about to call sem_wait, and MUST call
+ sem_wait after advertising the sleeping state.
+
+A notifier may notify another engine with the following transitions.
+
+sleeping -> notified
+ Wake an engine while holding the wake_lock, the
+ engine must also post to the sleep semaphore and
+ decrement the idle engines count. See
+ try_wake_engine.
+
+idle -> notified
+stealing -> notified
+ Notify an engine of an event. This must use a
+ CAS, so that it coordinates with the engine's own
+ CAS transitions.
+
+See also par_engine_state.dot for a graph of these states and transitions.
+
+The RTS can run in a polling mode where sem_timedwait is used instead of
+sem_wait, Define MR_WORKSTEAL_POLLING to enable this, it must also be
+defined when compiling the application as the mercury_context.h will
+define different macros. When running in this mode an engine may move
+from the sleeping to working states itself by using the lock in its sleep
+structure.
+
+This has been setup specifically to ensure that engines can be notified
+individually and that work is never lost. Any future changes must
+continue to prevent the following races.
+
+There are two engines, A becomes idle and checks for contexts to
+execute, then B schedules a new context. He cannot give it directly to A
+because A is not sleeping. So A never sees the context. Worse
+still, if this context can only be executed by A then B never continues
+this work and the whole system deadlocks. Therefore after placing the
+context on the runqueue, a context advice message is given to any
+engine that is in the idle, stealing or sleeping states (currently only
+for cases where the context may only be executed by a single engine).
+
+Similarly, if engine A is creating a spark, and engine B is in the
+stealing state may have already checked A's deque. So it's a good idea
+to notify an engine of a spark if it is in the stealing or sleeping
+states.
+
+
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20120904/bac9e035/attachment.sig>
More information about the reviews
mailing list