[m-rev.] For review: Implement support for threadscope profiling.

Paul Bone pbone at csse.unimelb.edu.au
Tue Dec 1 17:54:21 AEDT 2009
Previous message: [m-rev.] diff: Fix the recently broken hlc.par grades.
Next message: [m-rev.] For review: Implement support for threadscope profiling.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
For review by anyone.

Support for threadscope profiling of the parallel runtime.

This change adds support for threadscope profiling of the parallel runtime in
low level C grades.  It can be enabled by compiling _all_ code with the
MR_PROFILE_PARALLEL_EXECUTION_SUPPORT C macro defined.  The runtime, libraries
and applications must all have this flag defined as it alters the MercuryEngine
and MR_Context structures.

See Don Jones Jr, Simon Marlow, Satnam Singh - Parallel Performance Tuning for
Haskell.

This change also includes:

    Smarter thread pinning (the primordial thread is pinned to the thread that
    it is currently running on).

    The addition of callbacks from the Boehm GC to notify the runtime of
    stop the world garbage collections.

    Implement some userspace spin loops and conditions.  These are cheaper than
    their POSIX equivalents, do not support sleeping, and are signal handler
    safe. 

boehm_gc/alloc.h:
boehm_gc/alloc.c:
    Declare and define the new callback functions.

boehm_gc/alloc.c:
    Call the start and stop collect callbacks when we start and stop a
    stop-the-world collection.    
    Correct how we record the time spent collecting, it now includes
    collections that stop prematurely.

boehm_gc/pthread_stop_world.c:
    Call the pause and resume thread callbacks in each thread where the GC
    arranges for that thread to be stopped during a stop-the-world collection.

runtime/mercury_threadscope.c:
runtime/mercury_threadscope.h:
    New files implementing the threadscope support.

runtime/mercury_atomic_ops.c:
runtime/mercury_atomic_ops.h:
    Rename MR_configure_profiling_timers to MR_do_cpu_feature_detection.
    Add a new function MR_read_cpu_tsc() to read the TSC register from the CPU,
    this simply abstracts the static MR_rdtsc function.

runtime/mercury_atomic_ops.h:
    Modify the C inline assembler to ensure we tell the C compiler that the
    value in the register mapped to the 'old' parameter is also an output from
    the instructions.  That is, the C compiler must not depend on the value of
    'old' being the same before and after the instruction is executed.  This
    has never been a problem in practice though.
    Implement some cheap userspace mutual exclusion locks and condition
    variables.  These will be faster than pthread's mutexes when critical
    sections are short and threads are pinned to separate CPUs. 
    
runtime/mercury_context.c:
runtime/mercury_context.h:
    Add a new function for pinning the primordial thread.  If the OS supports
    sched_getcpu we use it to determine which CPU the primordial thread should
    use.  No other thread will be pinned to this CPU.
    Add a numeric id field to each context, this id is uniquely assigned and
    identifies each context for threadscope.
    Move MR_load_context from a C macro to a C procedure.
    MR_schedule_context posts the 'context runnable' threadscope event.
    MR_do_runnext has been modified to destroy engines differently, it ensures
    they cleanup properly so that their threadscope events are flushed properly
    and then calls pthread_exit(0)
    MR_do_runnext posts events for threadscope.
    MR_do_join_and_continue posts events for threadscope.

runtime/mercury_engine.h:
    Add new fields to the MercuryEngine structure including a buffer of
    threadscope events, a clock offset (used to synchronize the TSC clocks) and
    a unique identifier for the engine,

runtime/mercury_engine.c:
    Call MR_threadscope_setup_engine() and MR_threadscope_finalize_engine for
    newly created and about-to-be-destroyed engines.
    When the main context finishes on a thread that's not the primordial thread
    post a 'context is yielding' message before re-scheduling the context on
    the primordial thread.

runtime/mercury_thread.c:
    Added an XXX comment about a potential problem, it's only relevant for
    programs using thread.spawn.
    Added calls to the TSC synchronisation code used for threadscope profiling.
    It appears that this is not necessary on modern x86 machines, it has been
    commented out.
    Post a threadscope event when we create a new context.
    Don't call pthread_exit in MR_destroy_thread, we now do this in
    MR_do_runnext so that we can unlock the runqueue mutex after cleaning up.

runtime/mercury_wrapper.c:
    Conform to changes in mercury_atomic_ops.[ch]
    Post an event immediately before calling main to mark the beginning of the
    program in the threadscope profile.
    Post a "context finished" event at the end of the program.
    Wait until all engines have exited before cleaning up global data, this is
    important for finishing writing the threadscope data file. 

configure.in:
runtime/mercury_conf.h.in:
    Test for the sched_getcpu C function and utmpx.h header file, these are
    used for thread pinning.

runtime/Mmakefile:
    Include the mercury_threadscope.[hc] files in the list of runtime headers
    and sources respectively.

Index: configure.in
===================================================================
RCS file: /home/mercury1/repository/mercury/configure.in,v
retrieving revision 1.554
diff -u -p -b -r1.554 configure.in
--- configure.in	24 Nov 2009 23:49:44 -0000	1.554
+++ configure.in	1 Dec 2009 06:12:07 -0000
@@ -1153,7 +1153,8 @@ mercury_check_for_functions \
         getpid setpgid fork execlp wait kill \
         grantpt unlockpt ptsname tcgetattr tcsetattr ioctl \
         access sleep opendir readdir closedir mkdir symlink readlink \
-        gettimeofday setenv putenv _putenv posix_spawn sched_setaffinity
+        gettimeofday setenv putenv _putenv posix_spawn sched_setaffinity \
+        sched_getcpu
 
 #-----------------------------------------------------------------------------#
 
@@ -1163,7 +1164,7 @@ MERCURY_CHECK_FOR_HEADERS( \
         sys/types.h sys/stat.h fcntl.h termios.h sys/ioctl.h \
         sys/stropts.h windows.h dirent.h getopt.h malloc.h \
         semaphore.h pthread.h time.h spawn.h fenv.h sys/mman.h sys/sem.h \
-        sched.h)
+        sched.h utmpx.h)
 
 if test "$MR_HAVE_GETOPT_H" = 1; then
     GETOPT_H_AVAILABLE=yes
Index: boehm_gc/alloc.c
===================================================================
RCS file: /home/mercury1/repository/mercury/boehm_gc/alloc.c,v
retrieving revision 1.18
diff -u -p -b -r1.18 alloc.c
--- boehm_gc/alloc.c	18 Mar 2008 03:09:39 -0000	1.18
+++ boehm_gc/alloc.c	30 Nov 2009 10:34:15 -0000
@@ -65,6 +65,13 @@ GC_bool         GC_mercury_calc_gc_time 
 unsigned long 	GC_total_gc_time = 0;
 			   /* Measured in milliseconds.         */
 
+void (*GC_mercury_callback_start_collect)(void) = NULL;
+void (*GC_mercury_callback_stop_collect)(void) = NULL;
+void (*GC_mercury_callback_pause_thread)(void) = NULL;
+void (*GC_mercury_callback_resume_thread)(void) = NULL;
+                           /* Callbacks for mercury to notify   */
+                           /* the runtime of certain events     */
+
 #ifndef SMALL_CONFIG
   int GC_incremental = 0;  /* By default, stop the world.	*/
 #endif
@@ -311,6 +318,7 @@ void GC_maybe_gc(void)
 GC_bool GC_try_to_collect_inner(GC_stop_func stop_func)
 {
     CLOCK_TYPE start_time, current_time;
+    GC_bool     result = TRUE;
     if (GC_dont_gc) return FALSE;
     if (GC_incremental && GC_collection_in_progress()) {
       if (GC_print_stats) {
@@ -351,6 +359,9 @@ GC_bool GC_try_to_collect_inner(GC_stop_
         GC_save_callers(GC_last_stack);
 #   endif
     GC_is_full_gc = TRUE;
+    if (GC_mercury_callback_start_collect) {
+      GC_mercury_callback_start_collect();
+    }
     if (!GC_stopped_mark(stop_func)) {
       if (!GC_incremental) {
     	/* We're partially done and have no way to complete or use 	*/
@@ -360,14 +371,15 @@ GC_bool GC_try_to_collect_inner(GC_stop_
 	GC_unpromote_black_lists();
       } /* else we claim the world is already still consistent.  We'll 	*/
         /* finish incrementally.					*/
-      return(FALSE);
-    }
+      result = FALSE;
+    } else {
     GC_finish_collection();
+    }
     if (GC_print_stats || GC_mercury_calc_gc_time) {
 	unsigned long cur_gc_time;
         GET_TIME(current_time);
         cur_gc_time = MS_TIME_DIFF(current_time,start_time);
-        if (GC_print_stats) {
+        if (GC_print_stats && result) {
 	    GC_log_printf("Complete collection took %lu msecs\n",
                 cur_gc_time);
 	}
@@ -375,7 +387,10 @@ GC_bool GC_try_to_collect_inner(GC_stop_
             GC_total_gc_time += cur_gc_time;
 	}
     }
-    return(TRUE);
+    if (GC_mercury_callback_stop_collect) {
+      GC_mercury_callback_stop_collect();
+    }
+    return(result);
 }
 
 
Index: boehm_gc/pthread_stop_world.c
===================================================================
RCS file: /home/mercury1/repository/mercury/boehm_gc/pthread_stop_world.c,v
retrieving revision 1.4
diff -u -p -b -r1.4 pthread_stop_world.c
--- boehm_gc/pthread_stop_world.c	15 Aug 2006 04:19:27 -0000	1.4
+++ boehm_gc/pthread_stop_world.c	30 Nov 2009 10:34:15 -0000
@@ -119,7 +119,13 @@ void GC_suspend_handler_inner(ptr_t sig_
 void GC_suspend_handler(int sig, siginfo_t *info, void *context)
 {
   int old_errno = errno;
+  if (GC_mercury_callback_pause_thread) {
+    GC_mercury_callback_pause_thread();
+  }
   GC_with_callee_saves_pushed(GC_suspend_handler_inner, (ptr_t)(word)sig);
+  if (GC_mercury_callback_resume_thread) {
+    GC_mercury_callback_resume_thread();
+  }
   errno = old_errno;
 }
 #else
@@ -128,7 +134,13 @@ void GC_suspend_handler(int sig, siginfo
 void GC_suspend_handler(int sig, siginfo_t *info, void *context)
 {
   int old_errno = errno;
+  if (GC_mercury_callback_pause_thread) {
+    GC_mercury_callback_pause_thread();
+  }
   GC_suspend_handler_inner((ptr_t)(word)sig, context);
+  if (GC_mercury_callback_resume_thread) {
+    GC_mercury_callback_resume_thread();
+  }
   errno = old_errno;
 }
 #endif
Index: boehm_gc/include/gc.h
===================================================================
RCS file: /home/mercury1/repository/mercury/boehm_gc/include/gc.h,v
retrieving revision 1.19
diff -u -p -b -r1.19 gc.h
--- boehm_gc/include/gc.h	18 Mar 2008 03:09:42 -0000	1.19
+++ boehm_gc/include/gc.h	1 Dec 2009 00:25:02 -0000
@@ -244,6 +244,27 @@ GC_API unsigned long GC_total_gc_time;
 				/* so far by garbage collections. It is  */
 				/* measured in milliseconds.		 */
 
+/*
+ * Callbacks for mercury to notify the runtime of certain events.
+ */
+GC_API void (*GC_mercury_callback_start_collect)(void);
+                /* Starting a collection */
+GC_API void (*GC_mercury_callback_stop_collect)(void);
+                /* Stopping a collection */
+GC_API void (*GC_mercury_callback_pause_thread)(void);
+                /*
+                 * This thread is about to be paused.
+                 *
+                 * Use these with care!  They're called from a signal handler,
+                 * they must NOT allocate memory and if they do locking they
+                 * must use reentrant mutexes.  Also note that these do not
+                 * work on OS X/Darwin, On Darwin the world is stopped in a
+                 * different way, we can't easily add support for these
+                 * callbacks on Darwin. 
+                 */
+GC_API void (*GC_mercury_callback_resume_thread)(void);
+                /* This thread is about to be resumed */
+
 /* Public procedures */
 
 /* Initialize the collector.  This is only required when using thread-local
Index: runtime/Mmakefile
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/Mmakefile,v
retrieving revision 1.148
diff -u -p -b -r1.148 Mmakefile
--- runtime/Mmakefile	30 Oct 2009 03:33:27 -0000	1.148
+++ runtime/Mmakefile	30 Nov 2009 10:34:15 -0000
@@ -92,6 +92,7 @@ HDRS		=	\
 			mercury_tags.h		\
 			mercury_term_size.h	\
 			mercury_thread.h	\
+			mercury_threadscope.h	\
 			mercury_timing.h	\
 			mercury_trace_base.h	\
 			mercury_trace_term.h	\
@@ -199,6 +200,7 @@ CFILES		= 	\
 			mercury_tabling.c	\
 			mercury_term_size.c	\
 			mercury_thread.c	\
+			mercury_threadscope.c	\
 			mercury_timing.c	\
 			mercury_trace_base.c	\
 			mercury_trace_term.c	\
Index: runtime/mercury_atomic_ops.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_atomic_ops.c,v
retrieving revision 1.6
diff -u -p -b -r1.6 mercury_atomic_ops.c
--- runtime/mercury_atomic_ops.c	6 Nov 2009 05:40:24 -0000	1.6
+++ runtime/mercury_atomic_ops.c	30 Nov 2009 10:34:15 -0000
@@ -116,7 +116,7 @@ parse_freq_from_x86_brand_string(char *s
 #endif /* __GNUC__ && (__i386__ || __x86_64__) */
 
 extern void 
-MR_configure_profiling_timers(void) {
+MR_do_cpu_feature_detection(void) {
 #if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
     MR_Unsigned     a, b, c, d;
     MR_Unsigned     eflags, old_eflags;
@@ -434,6 +434,23 @@ MR_profiling_stop_timer(MR_Timer *timer,
 #endif /* not __GNUC__ && (__i386__ || __x86_64__) */
 }
 
+MR_uint_least64_t
+MR_read_cpu_tsc(void)
+{
+#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+    MR_uint_least64_t   tsc;
+
+    if (MR_rdtsc_is_available == MR_TRUE) {
+        MR_rdtsc(&tsc);
+    } else {
+        tsc = 0;
+    }
+    return tsc;
+#elif /* not __GNUC__ && (__i386__ || __x86_64__) */
+    return 0;
+#endif /* not __GNUC__ && (__i386__ || __x86_64__) */
+}
+
 /*
 ** It's convenient that this instruction is the same on both i386 and x86_64
 */
Index: runtime/mercury_atomic_ops.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_atomic_ops.h,v
retrieving revision 1.7
diff -u -p -b -r1.7 mercury_atomic_ops.h
--- runtime/mercury_atomic_ops.h	6 Nov 2009 05:40:24 -0000	1.7
+++ runtime/mercury_atomic_ops.h	1 Dec 2009 05:29:18 -0000
@@ -50,8 +50,8 @@ MR_compare_and_swap_word(volatile MR_Int
             char result;                                                    \
                                                                             \
             __asm__ __volatile__(                                           \
-                "lock; cmpxchgq %3, %0; setz %1"                            \
-                : "=m"(*addr), "=q"(result)                                 \
+                "lock; cmpxchgq %4, %0; setz %1"                            \
+                : "=m"(*addr), "=q"(result), "=a"(old)                      \
                 : "m"(*addr), "r" (new_val), "a"(old)                       \
             );                                                              \
             return (int) result;                                            \
@@ -65,8 +65,8 @@ MR_compare_and_swap_word(volatile MR_Int
             char result;                                                    \
                                                                             \
             __asm__ __volatile__(                                           \
-                "lock; cmpxchgl %3, %0; setz %1"                            \
-                : "=m"(*addr), "=q"(result)                                 \
+                "lock; cmpxchgl %4, %0; setz %1"                            \
+                : "=m"(*addr), "=q"(result), "=a"(old)                      \
                 : "m"(*addr), "r" (new_val), "a"(old)                       \
                 );                                                          \
             return (int) result;                                            \
@@ -307,6 +307,113 @@ MR_atomic_sub_int(volatile MR_Integer *a
 
 #endif
 
+/*
+** Memory fence operations.
+*/
+#if defined(__GNUC__) && ( defined(__i386__) || defined(__x86_64__) )
+    /*
+    ** Guarantees that any stores executed before this fence are globally
+    ** visible before those after this fence.
+    */
+    #define MR_CPU_SFENCE                                                   \
+        do {                                                                \
+            __asm__ __volatile__("sfence");                                 \
+        } while(0)
+
+    /*
+    ** Guarantees that any loads executed before this fence are complete before
+    ** any loads after this fence.
+    */
+    #define MR_CPU_LFENCE                                                   \
+        do {                                                                \
+            __asm__ __volatile__("lfence");                                 \
+        } while(0)
+
+    /*
+    ** A combination of the above.
+    */
+    #define MR_CPU_MFENCE                                                   \
+        do {                                                                \
+            __asm__ __volatile__("mfence");                                 \
+        } while(0)
+
+#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+    /*
+    ** Our memory fences are better than GCC's.  GCC only implements a full
+    ** fence.
+    */
+    #define MR_CPU_MFENCE                                                   \
+        do {                                                                \
+            __sync_synchronize();                                           \
+        } while(0)
+    #define MR_CPU_SFENCE MR_CPU_MFENCE
+    #define MR_CPU_LFENCE MR_CPU_MFENCE
+
+#else
+
+    #pragma error "Please implement memory fence operations for this " \
+        "compiler/architecture"
+
+#endif
+
+/*
+** Roll our own cheap user-space mutual exclusion locks.  Blocking without
+** spinning is not supported.  Storage for these locks should be volatile.
+**
+** I expect these to be faster than pthread mutexes when threads are pinned and
+** critical sections are short.
+*/
+typedef MR_Unsigned MR_Us_Lock;
+
+#define MR_US_LOCK_INITIAL_VALUE (0)
+
+#define MR_US_TRY_LOCK(x) \
+    MR_compare_and_swap_word(x, 0, 1)
+
+#define MR_US_SPIN_LOCK(x) \
+    while (!MR_compare_and_swap_word(x, 0, 1)) { \
+        MR_ATOMIC_PAUSE; \
+    }
+
+#define MR_US_UNLOCK(x) \
+    do { \
+        MR_CPU_MFENCE; \
+        *x = 0; \
+    } while(0)
+
+/*
+** Similar support for condition variables.  Again, make sure they are
+** volatile.
+**
+** XXX: These are not atomic, A waiting thread will not see a change until
+** sometime after the signaling thread has signaled the condition.  The same
+** race can occur when clearing a condition.  Order of memory operations is not
+** guaranteed either.
+*/
+typedef MR_Unsigned MR_Us_Cond;
+
+#define MR_US_COND_CLEAR(x) \
+    do { \
+        MR_CPU_MFENCE; \
+        *x = 0; \
+    } while(0)
+
+#define MR_US_COND_SET(x) \
+    do { \
+        MR_CPU_MFENCE; \
+        *x = 1; \
+        MR_CPU_MFENCE; \
+    } while(0)
+
+#define MR_US_SPIN_COND(x) \
+    do { \
+        while (!(*x)) { \
+            MR_ATOMIC_PAUSE; \
+        } \
+        MR_CPU_MFENCE; \
+    } while (0)
+
 #endif /* MR_LL_PARALLEL_CONJ */
 
 /*
@@ -346,16 +453,21 @@ typedef struct {
 /*
 ** The number of CPU clock cycles per second, ie a 1GHz CPU will have a value
 ** of 10^9, zero if unknown.
+** This value is only available after MR_do_cpu_feature_detection() has been
+** called.
 */
 extern MR_uint_least64_t MR_cpu_cycles_per_sec;
 
 /*
-** Configure the profiling stats code.  On i386 and x86_64 machines this uses
-** CPUID to determine if the RDTSCP instruction is available and not prohibited
-** by the OS.
+** Do CPU feature detection, this is necessary for the profile parallel
+** execution code and the threadscope code.
+** On i386 and x86_64 machines this uses CPUID to determine if the RDTSCP
+** instruction is available and not prohibited by the OS.
+** This function is idempotent.
+** Note: We assume that all processors on a SMP machine are equivalent.
 */
 extern void
-MR_configure_profiling_timers(void);
+MR_do_cpu_feature_detection(void);
 
 /*
 ** Start and initialize a timer structure.
@@ -369,6 +481,13 @@ MR_profiling_start_timer(MR_Timer *timer
 extern void
 MR_profiling_stop_timer(MR_Timer *timer, MR_Stats *stats);
 
+/*
+** Read the CPU's TSC.  This is currently only implemented for i386 and x86-64
+** systems.  It returns 0 when support is not available.
+*/
+extern MR_uint_least64_t
+MR_read_cpu_tsc(void);
+
 #endif /* MR_THREAD_SAFE && MR_PROFILE_PARALLEL_EXECUTION_SUPPORT */
 
 /*---------------------------------------------------------------------------*/
Index: runtime/mercury_conf.h.in
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_conf.h.in,v
retrieving revision 1.66
diff -u -p -b -r1.66 mercury_conf.h.in
--- runtime/mercury_conf.h.in	5 Nov 2009 05:47:40 -0000	1.66
+++ runtime/mercury_conf.h.in	30 Nov 2009 10:34:15 -0000
@@ -140,6 +140,7 @@
 **	MR_HAVE_SYS_MMAN_H	we have <sys/mman.h>
 **	MR_HAVE_SYS_SEM_H 	we have <sys/sem.h>
 **	MR_HAVE_SCHED_H		we have <sched.h>
+**	MR_HAVE_UTMPX_H		we have <utmpx.h>
 */
 #undef	MR_HAVE_SYS_SIGINFO_H
 #undef	MR_HAVE_SYS_SIGNAL_H
@@ -170,6 +171,7 @@
 #undef	MR_HAVE_SYS_MMAN_H
 #undef	MR_HAVE_SYS_SEM_H
 #undef	MR_HAVE_SCHED_H
+#undef  MR_HAVE_UTMPX_H
 
 /*
 ** MR_HAVE_POSIX_TIMES is defined if we have the POSIX
@@ -268,7 +270,8 @@
 **	MR_HAVE__PUTENV		we have the _putenv() function.
 **	MR_HAVE_POSIX_SPAWN	we have the posix_spawn() function.
 **	MR_HAVE_FESETROUND	we have the fesetround() function.
-**	MR_HAVE_SCHED_SETAFFINITY if we have the sched_setaffinity() function.
+**	MR_HAVE_SCHED_SETAFFINITY we have the sched_setaffinity() function.
+**	MR_HAVE_SCHED_GETCPU	we have the sched_getcpu() function (glibc specific).
 */
 #undef	MR_HAVE_GETPID
 #undef	MR_HAVE_SETPGID
@@ -331,6 +334,7 @@
 #undef	MR_HAVE_POSIX_SPAWN
 #undef	MR_HAVE_FESETROUND
 #undef	MR_HAVE_SCHED_SETAFFINITY
+#undef	MR_HAVE_SCHED_GETCPU
 
 /*
 ** We use mprotect() and signals to catch stack and heap overflows.
Index: runtime/mercury_context.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_context.c,v
retrieving revision 1.71
diff -u -p -b -r1.71 mercury_context.c
--- runtime/mercury_context.c	27 Nov 2009 03:51:19 -0000	1.71
+++ runtime/mercury_context.c	1 Dec 2009 05:31:19 -0000
@@ -38,11 +38,12 @@ ENDINIT
 #include "mercury_memory_handlers.h"
 #include "mercury_context.h"
 #include "mercury_engine.h"             /* for `MR_memdebug' */
+#include "mercury_threadscope.h"        /* for data types and posting events */
 #include "mercury_reg_workarounds.h"    /* for `MR_fd*' stuff */
 
-static void
-MR_init_context_maybe_generator(MR_Context *c, const char *id,
-    MR_GeneratorPtr gen);
+#if defined(MR_THREAD_SAFE) && defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT) 
+#define MR_PROFILE_PARALLEL_EXECUTION_FILENAME "parallel_execution_profile.txt"
+#endif
 
 /*---------------------------------------------------------------------------*/
 
@@ -66,9 +67,6 @@ MR_Context              *MR_runqueue_tai
 #ifdef  MR_LL_PARALLEL_CONJ
   MR_SparkDeque         MR_spark_queue;
   MercuryLock           MR_sync_term_lock;
-  MR_bool               MR_thread_pinning = MR_FALSE;
-  static MercuryLock    MR_next_cpu_lock;
-  static MR_Unsigned    MR_next_cpu = 0;
 #endif
 
 MR_PendingContext       *MR_pending_contexts;
@@ -96,14 +94,28 @@ static MR_Integer       MR_profile_paral
 static MR_Integer       MR_profile_parallel_regular_context_reused = 0;
 static MR_Integer       MR_profile_parallel_small_context_kept = 0;
 static MR_Integer       MR_profile_parallel_regular_context_kept = 0;
+#endif
 
 /*
-** Write out the profiling data that we collect during execution.
+** Local variables for thread pinning.
 */
-static void
-MR_write_out_profiling_parallel_execution(void);
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_HAVE_SCHED_SETAFFINITY)
+static MercuryLock      MR_next_cpu_lock;
+MR_bool                 MR_thread_pinning = MR_FALSE;
+static MR_Unsigned      MR_next_cpu = 0;
+#ifdef  MR_HAVE_SCHED_GETCPU
+static MR_Integer       MR_primordial_threads_cpu = -1;
+#endif
+#endif
 
-#define MR_PROFILE_PARALLEL_EXECUTION_FILENAME "parallel_execution_profile.txt"
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+/*
+** These are used to give each context it's own unique ID.
+*/
+static MercuryLock      MR_next_context_id_lock;
+static MR_ContextId     MR_next_context_id;
 #endif
 
 /*
@@ -123,11 +135,31 @@ int volatile MR_num_idle_engines = 0;
 int volatile MR_num_outstanding_contexts_and_global_sparks = 0;
 MR_Integer volatile MR_num_outstanding_contexts_and_all_sparks = 0;
 
+MR_Unsigned volatile MR_num_exited_engines = 0;
+
 static MercuryLock MR_par_cond_stats_lock;
 #endif
 
 /*---------------------------------------------------------------------------*/
 
+static void
+MR_init_context_maybe_generator(MR_Context *c, const char *id,
+    MR_GeneratorPtr gen);
+
+/*
+** Write out the profiling data that we collect during execution.
+*/
+static void
+MR_write_out_profiling_parallel_execution(void);
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_HAVE_SCHED_SETAFFINITY) 
+static void 
+MR_do_pin_thread(int cpu);
+#endif
+
+/*---------------------------------------------------------------------------*/
+
 void
 MR_init_thread_stuff(void)
 {
@@ -199,6 +231,39 @@ MR_init_thread_stuff(void)
 #endif /* MR_THREAD_SAFE */
 }
 
+/*
+** Pin the primordial thread first to the CPU it is currently using.  (where
+** support is available).
+*/
+void
+MR_pin_primordial_thread(void)
+{
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_HAVE_SCHED_SETAFFINITY) 
+#ifdef MR_HAVE_SCHED_GETCPU
+    /*
+    ** We don't need locking to pin the primordial thread as it is called
+    ** before any other threads exist.
+    */
+    if (MR_thread_pinning) {
+        MR_primordial_threads_cpu = sched_getcpu();
+        if (MR_primordial_threads_cpu == -1) {
+            perror("Warning: unable to determine the current CPU for "
+             "the primordial thread: ");
+        } else {
+            MR_do_pin_thread(MR_primordial_threads_cpu);
+        }
+    }
+    if (MR_primordial_threads_cpu == -1) {
+        MR_pin_thread();
+    }
+#else
+    MR_pin_thread();
+#endif
+#endif /* defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+            defined(MR_HAVE_SCHED_SETAFFINITY) */ 
+}
+
 void
 MR_pin_thread(void)
 {
@@ -206,16 +271,31 @@ MR_pin_thread(void)
         defined(MR_HAVE_SCHED_SETAFFINITY) 
     MR_LOCK(&MR_next_cpu_lock, "MR_pin_thread");
     if (MR_thread_pinning) {
+#if defined(MR_HAVE_SCHED_GETCPU)
+        if (MR_next_cpu == MR_primordial_threads_cpu) {
+            MR_next_cpu++;
+        }
+#endif
+        MR_do_pin_thread(MR_next_cpu);
+        MR_next_cpu++;
+    }
+    MR_UNLOCK(&MR_next_cpu_lock, "MR_pin_thread");
+#endif /* defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+            defined(MR_HAVE_SCHED_SETAFFINITY) */ 
+}
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_HAVE_SCHED_SETAFFINITY) 
+static void 
+MR_do_pin_thread(int cpu) 
+{
         cpu_set_t   cpus;
 
-        if (MR_next_cpu < CPU_SETSIZE) {
+    if (cpu < CPU_SETSIZE) {
             CPU_ZERO(&cpus);
-            CPU_SET(MR_next_cpu, &cpus);
-            if (sched_setaffinity(0, sizeof(cpu_set_t), &cpus) == 0)
-            {
-                MR_next_cpu++;
-            } else {
-                perror("Warning: Couldn't set CPU affinity");
+        CPU_SET(cpu, &cpus);
+        if (sched_setaffinity(0, sizeof(cpu_set_t), &cpus) == -1) {
+            perror("Warning: Couldn't set CPU affinity: ");
                 /* 
                 ** If this failed once it will probably fail again so we
                 ** disable it. 
@@ -224,14 +304,12 @@ MR_pin_thread(void)
             }
         } else {
             perror("Warning: Couldn't set CPU affinity due to a static "
-                "system limit");
+            "system limit: ");
             MR_thread_pinning = MR_FALSE;
         }
-    }
-    MR_UNLOCK(&MR_next_cpu_lock, "MR_pin_thread");
+}
 #endif /* defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
             defined(MR_HAVE_SCHED_SETAFFINITY) */ 
-}
 
 void
 MR_finalize_thread_stuff(void)
@@ -394,6 +472,11 @@ MR_init_context_maybe_generator(MR_Conte
     c->MR_ctxt_resume_owner_thread = (MercuryThread) NULL;
     c->MR_ctxt_resume_c_depth = 0;
     c->MR_ctxt_saved_owners = NULL;
+  #if defined(MR_LL_PARALLEL_CONJ) && defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    MR_LOCK(&MR_next_context_id_lock, "MR_create_context");
+    c->MR_ctxt_num_id = MR_next_context_id++; 
+    MR_UNLOCK(&MR_next_context_id_lock, "MR_create_context");
+  #endif
 #endif
 
 #ifndef MR_HIGHLEVEL_CODE
@@ -660,6 +743,73 @@ MR_destroy_context(MR_Context *c)
 }
 
 void 
+MR_load_context(MR_Context *load_context_c)
+{
+#ifndef MR_HIGHLEVEL_CODE
+    MR_succip_word = (MR_Word) load_context_c->MR_ctxt_succip;
+    MR_sp_word     = (MR_Word) load_context_c->MR_ctxt_sp;
+    MR_maxfr_word  = (MR_Word) load_context_c->MR_ctxt_maxfr;
+    MR_curfr_word  = (MR_Word) load_context_c->MR_ctxt_curfr;
+  #ifdef MR_USE_MINIMAL_MODEL_STACK_COPY
+    MR_gen_next = load_context_c->MR_ctxt_gen_next;
+    MR_cut_next = load_context_c->MR_ctxt_cut_next;
+    MR_pneg_next = load_context_c->MR_ctxt_pneg_next;
+  #endif
+  #ifdef MR_THREAD_SAFE
+    MR_parent_sp = load_context_c->MR_ctxt_parent_sp;
+  #endif
+#endif
+#ifdef MR_USE_TRAIL
+  #ifdef MR_THREAD_SAFE
+    MR_ENGINE(MR_eng_context).MR_ctxt_trail_zone =
+        load_context_c->MR_ctxt_trail_zone;
+  #else
+    MR_trail_zone = load_context_c->MR_ctxt_trail_zone;
+  #endif
+    MR_trail_ptr = load_context_c->MR_ctxt_trail_ptr;
+    MR_ticket_counter = load_context_c->MR_ctxt_ticket_counter;
+    MR_ticket_high_water = load_context_c->MR_ctxt_ticket_high_water;
+#endif
+#ifndef MR_HIGHLEVEL_CODE
+    MR_ENGINE(MR_eng_context).MR_ctxt_detstack_zone =
+        load_context_c->MR_ctxt_detstack_zone;
+    MR_ENGINE(MR_eng_context).MR_ctxt_prev_detstack_zones =
+        load_context_c->MR_ctxt_prev_detstack_zones;
+    MR_ENGINE(MR_eng_context).MR_ctxt_nondetstack_zone =
+        load_context_c->MR_ctxt_nondetstack_zone;
+    MR_ENGINE(MR_eng_context).MR_ctxt_prev_nondetstack_zones =
+        load_context_c->MR_ctxt_prev_nondetstack_zones;
+  #ifdef MR_USE_MINIMAL_MODEL_STACK_COPY
+        MR_ENGINE(MR_eng_context).MR_ctxt_genstack_zone =
+            load_context_c->MR_ctxt_genstack_zone;
+        MR_ENGINE(MR_eng_context).MR_ctxt_cutstack_zone =
+            load_context_c->MR_ctxt_cutstack_zone;
+        MR_ENGINE(MR_eng_context).MR_ctxt_pnegstack_zone =
+            load_context_c->MR_ctxt_pnegstack_zone;
+        MR_gen_stack = (MR_GenStackFrame *)
+            MR_ENGINE(MR_eng_context).MR_ctxt_genstack_zone->
+                MR_zone_min;
+        MR_cut_stack = (MR_CutStackFrame *)
+                MR_ENGINE(MR_eng_context).MR_ctxt_cutstack_zone->
+                MR_zone_min;
+        MR_pneg_stack = (MR_PNegStackFrame *)
+            MR_ENGINE(MR_eng_context).MR_ctxt_pnegstack_zone->
+                MR_zone_min;
+  #endif
+  #ifdef MR_EXEC_TRACE_INFO_IN_CONTEXT
+        MR_trace_call_seqno = load_context_c->MR_ctxt_call_seqno;
+        MR_trace_call_depth = load_context_c->MR_ctxt_call_depth;
+        MR_trace_event_number = load_context_c->MR_ctxt_event_number;
+  #endif
+#endif /* ! MR_HIGH_LEVEL_CODE */
+    MR_set_min_heap_reclamation_point(load_context_c);
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    MR_threadscope_post_run_context();
+#endif
+}
+
+void 
 MR_flounder(void)
 {
     MR_fatal_error("computation floundered");
@@ -771,6 +921,10 @@ MR_check_pending_contexts(MR_bool block)
 void
 MR_schedule_context(MR_Context *ctxt)
 {
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    MR_threadscope_post_context_runnable(ctxt);
+#endif
     MR_LOCK(&MR_runqueue_lock, "schedule_context");
     ctxt->MR_ctxt_next = NULL;
     if (MR_runqueue_tail) {
@@ -862,8 +1016,10 @@ MR_define_entry(MR_do_runnext);
             ** up the Mercury runtime.  It cannot exit by this route.
             */
             assert(thd != MR_primordial_thread);
-            MR_UNLOCK(&MR_runqueue_lock, "MR_do_runnext (ii)");
             MR_destroy_thread(MR_cur_engine());
+            MR_num_exited_engines++;
+            MR_UNLOCK(&MR_runqueue_lock, "MR_do_runnext (ii)");
+            pthread_exit(0);
         }
 
         /* Search for a ready context which we can handle. */
@@ -944,13 +1100,18 @@ MR_define_entry(MR_do_runnext);
     if (MR_ENGINE(MR_eng_this_context) == NULL) {
         MR_ENGINE(MR_eng_this_context) = MR_create_context("from spark",
             MR_CONTEXT_SIZE_SMALL, NULL);
-        MR_load_context(MR_ENGINE(MR_eng_this_context));
 #ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+        MR_threadscope_post_create_context_for_spark(MR_ENGINE(MR_eng_this_context));
         if (MR_profile_parallel_execution) {
             MR_atomic_inc_int(
                     &MR_profile_parallel_contexts_created_for_sparks);
         }
 #endif
+        MR_load_context(MR_ENGINE(MR_eng_this_context));
+    } else {
+#if defined(MR_LL_PARALLEL_CONJ) && defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+        MR_threadscope_post_run_context();
+#endif
     }
     MR_parent_sp = spark.MR_spark_parent_sp;
     MR_assert(MR_parent_sp != MR_sp);
@@ -992,6 +1153,11 @@ MR_END_MODULE
 MR_Code*
 MR_do_join_and_continue(MR_SyncTerm *jnc_st, MR_Code *join_label) 
 {
+    /*
+     * XXX: We should take the current TSC time here and use it to post the
+     * various 'context stopped' threadscope events.  This profile will be more
+     * accurate. 
+     */
     if (!jnc_st->MR_st_is_shared) {
         /* This parallel conjunction has only executed sequentially. */
         if (--jnc_st->MR_st_count == 0) {
@@ -1047,6 +1213,9 @@ MR_do_join_and_continue(MR_SyncTerm *jnc
                 jnc_st->MR_st_orig_context->MR_ctxt_resume = join_label;
                 MR_schedule_context(jnc_st->MR_st_orig_context);
                 MR_UNLOCK(&MR_sync_term_lock, "continue ii");
+#ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+                MR_threadscope_post_stop_context(MR_TS_STOP_REASON_FINISHED);
+#endif
                 return MR_ENTRY(MR_do_runnext);
             }
         } else {
@@ -1096,9 +1265,17 @@ MR_do_join_and_continue(MR_SyncTerm *jnc
                 ** away once we enable work-stealing. - pbone. 
                 */
                 if (jnc_ctxt == jnc_st->MR_st_orig_context) {
+#ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+                    MR_threadscope_post_stop_context(MR_TS_STOP_REASON_BLOCKED);
+#endif
                     MR_save_context(jnc_ctxt);
                     MR_ENGINE(MR_eng_this_context) = NULL;
+                } else {
+#ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+                    MR_threadscope_post_stop_context(MR_TS_STOP_REASON_FINISHED);
+#endif
                 }
+
                 /* Finally look for other work. */
                 MR_UNLOCK(&MR_sync_term_lock, "continue_2 ii");
                 return MR_ENTRY(MR_do_runnext);
Index: runtime/mercury_context.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_context.h,v
retrieving revision 1.56
diff -u -p -b -r1.56 mercury_context.h
--- runtime/mercury_context.h	27 Nov 2009 03:51:19 -0000	1.56
+++ runtime/mercury_context.h	1 Dec 2009 04:19:57 -0000
@@ -256,6 +256,10 @@ struct MR_SparkDeque_Struct {
 
 struct MR_Context_Struct {
     const char          *MR_ctxt_id;
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    MR_Unsigned         MR_ctxt_num_id;
+#endif
     MR_ContextSize      MR_ctxt_size;
     MR_Context          *MR_ctxt_next;
     MR_Code             *MR_ctxt_resume;
@@ -434,6 +438,13 @@ extern  MR_PendingContext   *MR_pending_
   ** instructions, even when done from within a critical section.
   */
   extern volatile MR_Integer    MR_num_outstanding_contexts_and_all_sparks;
+
+  /*
+  ** The number of engines that have exited so far,  We can spinn on this to
+  ** make sure that our engines have exited before finalizing some global
+  ** resources.
+  */
+  extern volatile MR_Unsigned   MR_num_exited_engines;
 #endif  /* !MR_LL_PARALLEL_CONJ */
 
 /*---------------------------------------------------------------------------*/
@@ -468,7 +479,11 @@ extern  void        MR_init_thread_stuff
 /*
 ** MR_pin_thread() pins the current thread to the next available processor ID,
 ** if thread pinning is enabled.
+** MR_pin_primordial_thread() is a special case for the primordial thread.  It
+** should only be executed once, and only by the primordial thread _before_
+** the other threads are started.
 */
+extern  void        MR_pin_primordial_thread(void);
 extern  void        MR_pin_thread(void);
 
 /*
@@ -582,71 +597,8 @@ extern  void        MR_schedule_context(
   #define MR_IF_NOT_HIGHLEVEL_CODE(x)
 #endif
 
-#define MR_load_context(cptr)                                                 \
-    do {                                                                      \
-        MR_Context  *load_context_c;                                          \
-                                                                              \
-        load_context_c = (cptr);                                              \
-        MR_IF_NOT_HIGHLEVEL_CODE(                                             \
-            MR_succip_word = (MR_Word) load_context_c->MR_ctxt_succip;        \
-            MR_sp_word     = (MR_Word) load_context_c->MR_ctxt_sp;            \
-            MR_maxfr_word  = (MR_Word) load_context_c->MR_ctxt_maxfr;         \
-            MR_curfr_word  = (MR_Word) load_context_c->MR_ctxt_curfr;         \
-            MR_IF_USE_MINIMAL_MODEL_STACK_COPY(                               \
-                MR_gen_next = load_context_c->MR_ctxt_gen_next;               \
-                MR_cut_next = load_context_c->MR_ctxt_cut_next;               \
-                MR_pneg_next = load_context_c->MR_ctxt_pneg_next;             \
-            )                                                                 \
-            MR_IF_THREAD_SAFE(                                                \
-                MR_parent_sp = load_context_c->MR_ctxt_parent_sp;             \
-            )                                                                 \
-        )                                                                     \
-        MR_IF_USE_TRAIL(                                                      \
-            MR_IF_NOT_THREAD_SAFE(                                            \
-                MR_trail_zone = load_context_c->MR_ctxt_trail_zone;           \
-            )                                                                 \
-            MR_IF_THREAD_SAFE(                                                \
-                MR_ENGINE(MR_eng_context).MR_ctxt_trail_zone =                \
-                    load_context_c->MR_ctxt_trail_zone;                       \
-            )                                                                 \
-            MR_trail_ptr = load_context_c->MR_ctxt_trail_ptr;                 \
-            MR_ticket_counter = load_context_c->MR_ctxt_ticket_counter;       \
-            MR_ticket_high_water = load_context_c->MR_ctxt_ticket_high_water; \
-        )                                                                     \
-        MR_IF_NOT_HIGHLEVEL_CODE(                                             \
-            MR_ENGINE(MR_eng_context).MR_ctxt_detstack_zone =                 \
-                load_context_c->MR_ctxt_detstack_zone;                        \
-            MR_ENGINE(MR_eng_context).MR_ctxt_prev_detstack_zones =           \
-                load_context_c->MR_ctxt_prev_detstack_zones;                  \
-            MR_ENGINE(MR_eng_context).MR_ctxt_nondetstack_zone =              \
-                load_context_c->MR_ctxt_nondetstack_zone;                     \
-            MR_ENGINE(MR_eng_context).MR_ctxt_prev_nondetstack_zones =        \
-                load_context_c->MR_ctxt_prev_nondetstack_zones;               \
-            MR_IF_USE_MINIMAL_MODEL_STACK_COPY(                               \
-                MR_ENGINE(MR_eng_context).MR_ctxt_genstack_zone =             \
-                    load_context_c->MR_ctxt_genstack_zone;                    \
-                MR_ENGINE(MR_eng_context).MR_ctxt_cutstack_zone =             \
-                    load_context_c->MR_ctxt_cutstack_zone;                    \
-                MR_ENGINE(MR_eng_context).MR_ctxt_pnegstack_zone =            \
-                    load_context_c->MR_ctxt_pnegstack_zone;                   \
-                MR_gen_stack = (MR_GenStackFrame *)                           \
-                    MR_ENGINE(MR_eng_context).MR_ctxt_genstack_zone->         \
-                        MR_zone_min;                                          \
-                MR_cut_stack = (MR_CutStackFrame *)                           \
-                    MR_ENGINE(MR_eng_context).MR_ctxt_cutstack_zone->         \
-                        MR_zone_min;                                          \
-                MR_pneg_stack = (MR_PNegStackFrame *)                         \
-                    MR_ENGINE(MR_eng_context).MR_ctxt_pnegstack_zone->        \
-                        MR_zone_min;                                          \
-            )                                                                 \
-            MR_IF_EXEC_TRACE_INFO_IN_CONTEXT(                                 \
-                MR_trace_call_seqno = load_context_c->MR_ctxt_call_seqno;     \
-                MR_trace_call_depth = load_context_c->MR_ctxt_call_depth;     \
-                MR_trace_event_number = load_context_c->MR_ctxt_event_number; \
-            )                                                                 \
-        )                                                                     \
-        MR_set_min_heap_reclamation_point(load_context_c);                    \
-    } while (0)
+void
+MR_load_context(MR_Context *c);
 
 #define MR_save_context(cptr)                                                 \
     do {                                                                      \
Index: runtime/mercury_engine.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_engine.c,v
retrieving revision 1.58
diff -u -p -b -r1.58 mercury_engine.c
--- runtime/mercury_engine.c	4 Apr 2007 01:09:52 -0000	1.58
+++ runtime/mercury_engine.c	30 Nov 2009 10:34:15 -0000
@@ -20,6 +20,8 @@ ENDINIT
 #include    "mercury_engine.h"
 #include    "mercury_memory_zones.h"    /* for MR_create_zone() */
 #include    "mercury_memory_handlers.h" /* for MR_default_handler() */
+#include    "mercury_threadscope.h"     /* for MR_threadscope_setup_engine()
+                                           and event posting */
 
 #include    "mercury_dummy.h"
 
@@ -147,6 +149,11 @@ MR_init_engine(MercuryEngine *eng)
     eng->MR_eng_c_depth = 0;
 #endif
 
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    MR_threadscope_setup_engine(eng);
+#endif
+
     /*
     ** Don't allocate a context for this engine until it is actually needed.
     */
@@ -164,6 +171,13 @@ void MR_finalize_engine(MercuryEngine *e
     if (eng->MR_eng_this_context) {
         MR_destroy_context(eng->MR_eng_this_context);
     }
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    if (eng->MR_eng_ts_buffer) {
+        MR_threadscope_finalize_engine(eng);
+    }
+#endif
 }
 
 /*---------------------------------------------------------------------------*/
@@ -513,6 +527,9 @@ MR_define_label(engine_done);
             MR_GOTO_LABEL(engine_done_2);
         }
 
+#ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+        MR_threadscope_post_stop_context(MR_TS_STOP_REASON_YIELDING);
+#endif
         MR_save_context(this_ctxt);
         this_ctxt->MR_ctxt_resume = MR_LABEL(engine_done_2);
         this_ctxt->MR_ctxt_resume_owner_thread = owner->MR_saved_owner_thread;
Index: runtime/mercury_engine.h
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_engine.h,v
retrieving revision 1.50
diff -u -p -b -r1.50 mercury_engine.h
--- runtime/mercury_engine.h	30 Oct 2009 03:33:28 -0000	1.50
+++ runtime/mercury_engine.h	30 Nov 2009 10:34:15 -0000
@@ -392,6 +392,16 @@ typedef struct MR_mercury_engine_struct 
 #ifdef  MR_THREAD_SAFE
     MercuryThread       MR_eng_owner_thread;
     MR_Unsigned         MR_eng_c_depth;
+#if defined(MR_LL_PARALLEL_CONJ) && defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    /*
+    ** For each profiling event add this offset to the time so that events on
+    ** different engines that occur at the same time have the same time in
+    ** clock ticks.
+    */
+    MR_int_least64_t                    MR_eng_cpu_clock_ticks_offset;
+    struct MR_threadscope_event_buffer  *MR_eng_ts_buffer;
+    MR_Unsigned                         MR_eng_id;
+#endif
 #endif
     jmp_buf             *MR_eng_jmp_buf;
     MR_Word             *MR_eng_exception;
Index: runtime/mercury_thread.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_thread.c,v
retrieving revision 1.36
diff -u -p -b -r1.36 mercury_thread.c
--- runtime/mercury_thread.c	27 Nov 2009 03:51:20 -0000	1.36
+++ runtime/mercury_thread.c	30 Nov 2009 10:34:15 -0000
@@ -13,6 +13,7 @@
 #include "mercury_memory.h"
 #include "mercury_context.h"    /* for MR_do_runnext */
 #include "mercury_thread.h"
+#include "mercury_threadscope.h"
 
 #include <stdio.h>
 #include <errno.h>
@@ -89,6 +90,7 @@ MR_create_thread_2(void *goal0)
     if (goal != NULL) {
         MR_init_thread(MR_use_now);
         (goal->func)(goal->arg);
+        /* XXX: We should clean up the engine here */
     } else {
         MR_pin_thread();
         MR_init_thread(MR_use_later);
@@ -129,6 +131,17 @@ MR_init_thread(MR_when_to_use when_to_us
 
 #ifdef  MR_THREAD_SAFE
     MR_ENGINE(MR_eng_owner_thread) = pthread_self();
+#if defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    /*
+    ** TSC Synchronization is not used support is commented out.  See
+    ** runtime/mercury_threadscope.h
+    **
+    if (when_to_use == MR_use_later) {
+        MR_threadscope_sync_tsc_slave();
+    }
+    */
+#endif
 #endif
 
     switch (when_to_use) {
@@ -137,6 +150,7 @@ MR_init_thread(MR_when_to_use when_to_us
             MR_fatal_error("Sorry, not implemented: "
                 "--high-level-code and multiple engines");
 #else
+            /* This call may never return */
             (void) MR_call_engine(MR_ENTRY(MR_do_runnext), MR_FALSE);
 #endif
             MR_destroy_engine(eng);
@@ -152,6 +166,10 @@ MR_init_thread(MR_when_to_use when_to_us
                 MR_ENGINE(MR_eng_this_context) =
                     MR_create_context("init_thread",
                         MR_CONTEXT_SIZE_REGULAR, NULL);
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+    defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+                MR_threadscope_post_create_context(MR_ENGINE(MR_eng_this_context));
+#endif
             }
             MR_load_context(MR_ENGINE(MR_eng_this_context));
             MR_save_registers();
@@ -189,7 +207,6 @@ MR_destroy_thread(void *eng0)
 {
     MercuryEngine *eng = eng0;
     MR_destroy_engine(eng);
-    pthread_exit(0);
 }
 
 #endif
Index: runtime/mercury_threadscope.c
===================================================================
RCS file: runtime/mercury_threadscope.c
diff -N runtime/mercury_threadscope.c
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ runtime/mercury_threadscope.c	1 Dec 2009 06:31:17 -0000
@@ -0,0 +1,1203 @@
+/*
+** vim: ts=4 sw=4 expandtab
+*/
+/*
+INIT mercury_sys_init_threadscope
+ENDINIT
+*/
+/*
+** Copyright (C) 2009 The University of Melbourne.
+** Copyright (C) 2008-2009 The GHC Team.
+**
+** This file may only be copied under the terms of the GNU Library General
+** Public License - see the file COPYING.LIB in the Mercury distribution.
+*/
+
+/*
+** Event log format
+** 
+** The log format is designed to be extensible: old tools should be
+** able to parse (but not necessarily understand all of) new versions
+** of the format, and new tools will be able to understand old log
+** files.
+** 
+** Each event has a specific format.  If you add new events, give them
+** new numbers: we never re-use old event numbers.
+**
+** - The format is endian-independent: all values are represented in 
+**    bigendian order.
+**
+** - The format is extensible:
+**
+**    - The header describes each event type and its length.  Tools
+**      that don't recognise a particular event type can skip those events.
+**
+**    - There is room for extra information in the event type
+**      specification, which can be ignored by older tools.
+**
+**    - Events can have extra information added, but existing fields
+**      cannot be changed.  Tools should ignore extra fields at the
+**      end of the event record.
+**
+**    - Old event type ids are never re-used; just take a new identifier.
+**
+**
+** The format
+** ----------
+**
+** log : EVENT_HEADER_BEGIN
+**       EventType*
+**       EVENT_HEADER_END
+**       EVENT_DATA_BEGIN
+**       Event*
+**       EVENT_DATA_END
+**
+** EventType :
+**       EVENT_ET_BEGIN
+**       Word16         -- unique identifier for this event
+**       Int16          -- >=0  size of the event in bytes (minus the header)
+**                      -- -1   variable size
+**       Word32         -- length of the next field in bytes
+**       Word8*         -- string describing the event
+**       Word32         -- length of the next field in bytes
+**       Word8*         -- extra info (for future extensions)
+**       EVENT_ET_END
+**
+** Event : 
+**       Word16         -- event_type
+**       Word64         -- time (nanosecs)
+**       [Word16]       -- length of the rest (for variable-sized events only)
+**       ... extra event-specific info ...
+**
+** All values a packed, no attempt is made to align them.
+**
+** New events must be registered with GHC.  These are kept in the GHC-events
+** package.
+**
+*/
+
+#include "mercury_imp.h"
+
+#include "mercury_threadscope.h"
+
+#include "mercury_atomic_ops.h"
+
+#include <stdio.h>
+#include <string.h>
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+
+/***************************************************************************/
+
+/*
+** Markers for begin/end of the Header.
+*/
+#define MR_TS_EVENT_HEADER_BEGIN    0x68647262 /* 'h' 'd' 'r' 'b' */
+#define MR_TS_EVENT_HEADER_END      0x68647265 /* 'h' 'd' 'r' 'e' */
+
+#define MR_TS_EVENT_DATA_BEGIN      0x64617462 /* 'd' 'a' 't' 'b' */
+#define MR_TS_EVENT_DATA_END        0xffff
+
+/*
+** Markers for begin/end of the list of Event Types in the Header.
+** Header, Event Type, Begin = hetb
+** Header, Event Type, End = hete
+*/
+#define MR_TS_EVENT_HET_BEGIN       0x68657462 /* 'h' 'e' 't' 'b' */
+#define MR_TS_EVENT_HET_END         0x68657465 /* 'h' 'e' 't' 'e' */
+
+/*
+** Markers for the beginning and end of individual event types.
+*/
+#define MR_TS_EVENT_ET_BEGIN        0x65746200 /* 'e' 't' 'b' 0 */
+#define MR_TS_EVENT_ET_END          0x65746500 /* 'e' 't' 'e' 0 */
+
+/*
+** The threadscope events:
+*/
+#define MR_TS_EVENT_CREATE_THREAD        0 /* (thread)               */
+#define MR_TS_EVENT_RUN_THREAD           1 /* (thread)               */
+#define MR_TS_EVENT_STOP_THREAD          2 /* (thread, status)       */
+#define MR_TS_EVENT_THREAD_RUNNABLE      3 /* (thread)               */
+#define MR_TS_EVENT_MIGRATE_THREAD       4 /* (thread, new_cap)      */
+#define MR_TS_EVENT_RUN_SPARK            5 /* (thread)               */
+#define MR_TS_EVENT_STEAL_SPARK          6 /* (thread, victim_cap)   */
+#define MR_TS_EVENT_SHUTDOWN             7 /* ()                     */
+#define MR_TS_EVENT_THREAD_WAKEUP        8 /* (thread, other_cap)    */
+#define MR_TS_EVENT_GC_START             9 /* ()                     */
+#define MR_TS_EVENT_GC_END              10 /* ()                     */
+#define MR_TS_EVENT_REQUEST_SEQ_GC      11 /* ()                     */
+#define MR_TS_EVENT_REQUEST_PAR_GC      12 /* ()                     */
+#define MR_TS_EVENT_CREATE_SPARK_THREAD 15 /* (spark_thread)         */
+#define MR_TS_EVENT_LOG_MSG             16 /* (message ...)          */
+#define MR_TS_EVENT_STARTUP             17 /* (num_capabilities)     */
+#define MR_TS_EVENT_BLOCK_MARKER        18 /* (size, end_time, capability) */
+#define MR_TS_EVENT_USER_MSG            19 /* (message ...)          */
+#define MR_TS_EVENT_GC_IDLE             20 /* () */
+#define MR_TS_EVENT_GC_WORK             21 /* () */
+#define MR_TS_EVENT_GC_DONE             22 /* () */
+#define MR_TS_EVENT_CALL_MAIN           23 /* () */
+
+#define MR_TS_NUM_EVENT_TAGS            24
+
+#if 0  /* DEPRECATED EVENTS: */
+#define EVENT_CREATE_SPARK        13 /* (cap, thread) */
+#define EVENT_SPARK_TO_THREAD     14 /* (cap, thread, spark_thread) */
+#endif
+
+/*
+** GHC uses 2MB per buffer.  Note that the minimum buffer size is the size of
+** the largest message plus the size of the block marker message, however it is
+** _sensible_ for the buffer to be much larger so that we make system calls
+** less often.
+*/
+#define MR_TS_BUFFERSIZE (2*1024*1024)
+#define MR_TS_FILENAME_FORMAT ("%s.eventlog")
+#define MR_TSC_SYNC_NUM_ROUNDS (10)
+#define MR_TSC_SYNC_NUM_BEST_ROUNDS (3)
+
+/* Uncomment this to enable some debugging code */
+/* #define MR_DEBUG_THREADSCOPE 1 */
+
+#if MR_DEBUG_THREADSCOPE
+#define MR_DO_THREADSCOPE_DEBUG(x) do { x; } while(0)
+#else
+#define MR_DO_THREADSCOPE_DEBUG(x)
+#endif
+
+/***************************************************************************/
+
+struct MR_threadscope_event_buffer {
+    MR_UnsignedChar     MR_tsbuffer_data[MR_TS_BUFFERSIZE];
+
+    /* The current writing position in the buffer. */
+    MR_Unsigned         MR_tsbuffer_pos;
+
+    /* The position of the start of the most recent block. */
+    MR_Integer          MR_tsbuffer_block_open_pos;
+
+    /* A cheap userspace lock to make buffers reentrant. */
+    volatile MR_Us_Lock MR_tsbuffer_lock;
+};
+
+/*
+** We define some types and functions to write them.  These types are set
+** carefully to match the ones that GHC uses.
+*/
+typedef MR_uint_least16_t   EventType;
+typedef MR_uint_least64_t   Time;
+typedef MR_int_least64_t    Timedelta;
+
+/*
+** The difference between two positions in the eventlog file measured in bytes.
+*/
+typedef MR_uint_least32_t   EventlogOffset;
+
+typedef struct {
+    EventType   etd_event_type;
+    const char  *etd_description;
+} EventTypeDesc;
+
+/***************************************************************************/
+
+static EventTypeDesc event_type_descs[] = {
+    {
+        /*
+        ** The startup event informs threadscope of the number of engines we're
+        ** using.  It should be given outside of a block.
+        */
+        MR_TS_EVENT_STARTUP,
+        "Startup (num_engines)"
+    },
+    { 
+        /*
+        ** The last event in the log.  It should be given outside of a block.
+        */
+        MR_TS_EVENT_SHUTDOWN, "Shutdown"
+    },
+    {
+        /*
+        ** A block of events belonging to the named engine follows,
+        ** The length of this block is given including the block message
+        ** itself, the time that this block finishes is also given.
+        ** Blocks _must not_ exist within other blocks.
+        */
+        MR_TS_EVENT_BLOCK_MARKER, 
+        "A block of events generated by a specific engine follows" 
+    },
+    {
+        /*
+        ** Called when a context is created or re-used.
+        */
+        MR_TS_EVENT_CREATE_THREAD,
+        "A context is created or re-used"
+    },
+    {
+        /*
+        ** Called from MR_schedule_context()
+        */
+        MR_TS_EVENT_THREAD_RUNNABLE,
+        "The context is being placed on the run queue"
+    },
+    {
+        /*
+        ** The named context begun executing on the engine named by the current
+        ** block.
+        */
+        MR_TS_EVENT_RUN_THREAD, "Run context"
+    },
+    {
+        /*
+        ** The named context finished executing on the engine named by the
+        ** current block.  The reason why the context stopped is given.
+        */
+        MR_TS_EVENT_STOP_THREAD, 
+        "Context stopped"
+    },
+    {
+        /*
+        ** This event is posted when a context is created for a spark.
+        */
+        MR_TS_EVENT_CREATE_SPARK_THREAD,
+        "Create a context for executing a spark"
+    },
+    {
+        /*
+        ** Start a garbage collection run
+        */
+        MR_TS_EVENT_GC_START,
+        "Start GC"
+    },
+    {
+        /*
+        ** Stop a garbage collection run
+        */
+        MR_TS_EVENT_GC_END,
+        "Stop GC",
+    },
+    {
+        /*
+        ** The runtime system is about to call main/2.  This message has no
+        ** parameters.
+        */
+        MR_TS_EVENT_CALL_MAIN,
+        "About to call main/2"
+    },
+    {
+        /* Mark the end of this array. */
+        MR_TS_NUM_EVENT_TAGS, NULL
+    }
+};
+
+static MR_uint_least16_t event_type_sizes[] = {
+    [MR_TS_EVENT_STARTUP]           = 2, /* MR_EngineId */
+    [MR_TS_EVENT_SHUTDOWN]          = 0,
+    [MR_TS_EVENT_BLOCK_MARKER]      = 4 + 8 + 2, 
+                                      /* EnginelogOffset, Time, MR_EngineId */
+    [MR_TS_EVENT_CREATE_THREAD]     = 4, /* MR_ContextId */
+    [MR_TS_EVENT_THREAD_RUNNABLE]   = 4, /* MR_ContextId */
+    [MR_TS_EVENT_RUN_THREAD]        = 4, /* MR_ContextId */
+    [MR_TS_EVENT_STOP_THREAD]       = 4 + 2,
+                                      /* MR_ContextId, MR_ContextStopReason */
+    [MR_TS_EVENT_CREATE_SPARK_THREAD] = 4, /* MR_ContextId */
+    [MR_TS_EVENT_GC_START]          = 0,
+    [MR_TS_EVENT_GC_END]            = 0,
+    [MR_TS_EVENT_CALL_MAIN]         = 0,
+};
+
+static FILE* MR_threadscope_output_file = NULL;
+static char* MR_threadscope_output_filename;
+
+/*
+** The TSC value recorded when the primordial thread called
+** MR_setup_threadscope(), this is used retroactivly to initialise the
+** MR_eng_cpu_clock_ticks_offset field in the engine structure once it is
+** created.
+*/
+static MR_uint_least64_t MR_primordial_first_tsc; 
+
+static MercuryLock      MR_next_engine_id_lock;
+static MR_EngineId      MR_next_engine_id = 0;
+
+static Timedelta        MR_global_offset;
+
+static struct MR_threadscope_event_buffer global_buffer;
+
+/***************************************************************************/
+
+/*
+** Is there enough room for this statically sized event in the current engine's
+** buffer _and_ enough room for the block marker event.
+*/
+static __inline__ MR_bool enough_room_for_event(
+        struct MR_threadscope_event_buffer *buffer,
+        EventType event_type) 
+{
+    return (buffer->MR_tsbuffer_pos + event_type_sizes[event_type] +
+                event_type_sizes[MR_TS_EVENT_BLOCK_MARKER] +
+                ((2 + 8) * 2)) /* (EventType, Time) * 2 */
+            < MR_TS_BUFFERSIZE; 
+}
+
+/*
+** Is a block currently open?
+*/
+static __inline__ MR_bool block_is_open(
+        struct MR_threadscope_event_buffer *buffer)
+{
+    return !(buffer->MR_tsbuffer_block_open_pos == -1);
+}
+
+/*
+** Put words into the current engine's buffer in big endian order.
+*/
+static __inline__ void put_byte(
+        struct MR_threadscope_event_buffer *buffer, 
+        int byte) 
+{
+    buffer->MR_tsbuffer_data[buffer->MR_tsbuffer_pos++] = byte;
+}
+
+static __inline__ void put_be_int16(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_int_least16_t word) 
+{
+    put_byte(buffer, (word >> 8) & 0xFF);
+    put_byte(buffer, word & 0xFF);
+}
+
+static __inline__ void put_be_uint16(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_uint_least16_t word) 
+{
+    put_byte(buffer, (word >> 8) & 0xFF);
+    put_byte(buffer, word & 0xFF);
+}
+
+static __inline__ void put_be_uint32(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_uint_least32_t word) 
+{
+    put_be_uint16(buffer, (word >> 16) & 0xFFFF);
+    put_be_uint16(buffer, word & 0xFFFF);
+}
+
+static __inline__ void put_be_uint64(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_uint_least64_t word)
+{
+    put_be_uint32(buffer, (word >> 32) & 0xFFFFFFFF);
+    put_be_uint32(buffer, word & 0xFFFFFFFF);
+}
+
+static __inline__ void put_string(
+        struct MR_threadscope_event_buffer *buffer,
+        const char *string)
+{
+    unsigned i, len;
+
+    len = strlen(string);
+    put_be_uint32(buffer, len);
+    for (i = 0; i < len; i++) {
+        put_byte(buffer, string[i]);
+    }
+}
+
+static __inline__ void put_timestamp(
+        struct MR_threadscope_event_buffer *buffer,
+        Time timestamp) 
+{
+    put_be_uint64(buffer, timestamp);
+}
+
+static __inline__ void put_eventlog_offset(
+        struct MR_threadscope_event_buffer *buffer,
+        EventlogOffset offset) 
+{
+    put_be_uint32(buffer, offset);
+}
+
+static __inline__ void put_event_header(
+        struct MR_threadscope_event_buffer *buffer,
+        EventType event_type, Time timestamp) 
+{
+    put_be_uint16(buffer, event_type);
+    put_timestamp(buffer, timestamp);
+}
+
+static __inline__ void put_engine_id(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_EngineId engine_num) 
+{
+    put_be_uint16(buffer, engine_num);
+}
+
+static __inline__ void put_context_id(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_ContextId context_id) 
+{
+    put_be_uint32(buffer, context_id);
+}
+
+static __inline__ void put_stop_reason(
+        struct MR_threadscope_event_buffer *buffer,
+        MR_ContextStopReason reason) 
+{
+    put_be_uint16(buffer, reason);
+}
+
+/***************************************************************************/
+
+static struct MR_threadscope_event_buffer* 
+MR_create_event_buffer(void);
+
+/*
+** The prelude is everything up to and including the 'DATA_BEGIN' marker
+*/
+static void 
+MR_open_output_file_and_write_prelude(void);
+
+static void
+MR_close_output_file(void);
+
+static void
+put_event_type(struct MR_threadscope_event_buffer *buffer, 
+    EventTypeDesc *event_type);
+
+static MR_bool 
+flush_event_buffer(struct MR_threadscope_event_buffer *buffer); 
+
+static void
+maybe_close_block(struct MR_threadscope_event_buffer *buffer);
+
+static void
+open_block(struct MR_threadscope_event_buffer *buffer);
+
+static void
+start_gc_callback(void);
+static void
+stop_gc_callback(void);
+static void
+pause_thread_gc_callback(void);
+static void
+resume_thread_gc_callback(void);
+
+/***************************************************************************/
+
+static MR_uint_least64_t 
+get_current_time_nanosecs(void);
+
+/***************************************************************************/
+
+void
+MR_setup_threadscope(void) 
+{
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In setup threadscope thread: 0x%lx\n", pthread_self())
+    );
+    /* This value is used later when setting up the primordial engine */
+    MR_primordial_first_tsc = MR_read_cpu_tsc();
+   
+    /* Setup locks. */
+    pthread_mutex_init(&MR_next_engine_id_lock, MR_MUTEX_ATTR);
+    
+    /*
+    ** These variables are used for TSC synchronization which is not used.  See
+    ** below.
+    **
+    pthread_mutex_init(&MR_tsc_sync_slave_lock, MR_MUTEX_ATTR);
+    MR_US_COND_CLEAR(&MR_tsc_sync_slave_entry_cond);
+    MR_US_COND_CLEAR(&MR_tsc_sync_master_entry_cond);
+    MR_US_COND_CLEAR(&MR_tsc_sync_t0);
+    MR_US_COND_CLEAR(&MR_tsc_sync_t1);
+    */
+    
+    /* Configure Boehm */
+    GC_mercury_callback_start_collect = start_gc_callback;
+    GC_mercury_callback_stop_collect = stop_gc_callback;
+    GC_mercury_callback_pause_thread = pause_thread_gc_callback;
+    GC_mercury_callback_resume_thread = resume_thread_gc_callback;
+
+    /* Clear the global buffer and setup the file */
+    global_buffer.MR_tsbuffer_pos = 0;
+    global_buffer.MR_tsbuffer_block_open_pos = -1;
+    global_buffer.MR_tsbuffer_lock = MR_US_LOCK_INITIAL_VALUE;
+    MR_open_output_file_and_write_prelude();
+    
+    /*
+    ** Put the startup event in the buffer.
+    */
+    put_event_header(&global_buffer, MR_TS_EVENT_STARTUP, 0);
+    put_engine_id(&global_buffer, (MR_EngineId)MR_num_threads);
+    flush_event_buffer(&global_buffer);
+}
+
+void
+MR_finalize_threadscope(void) 
+{
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In finalize threadscope thread: 0x%lx\n", pthread_self())
+    );
+    flush_event_buffer(&global_buffer);
+    MR_close_output_file();
+}
+
+void
+MR_threadscope_setup_engine(MercuryEngine *eng) 
+{
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In threadscope setup engine thread: 0x%lx\n", pthread_self())
+    );
+    MR_LOCK(&MR_next_engine_id_lock, "MR_get_next_engine_id");
+    eng->MR_eng_id = MR_next_engine_id++;
+    MR_UNLOCK(&MR_next_engine_id_lock, "MR_get_next_engine_id");
+
+    if (eng->MR_eng_id == 0) {
+        MR_global_offset = -MR_primordial_first_tsc;
+    }
+    eng->MR_eng_cpu_clock_ticks_offset = MR_global_offset;
+
+    eng->MR_eng_ts_buffer = MR_create_event_buffer();
+}
+
+void
+MR_threadscope_finalize_engine(MercuryEngine *eng)
+{
+    struct MR_threadscope_event_buffer *buffer = eng->MR_eng_ts_buffer;
+    
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In threadscope finalize engine thread: 0x%lx\n", pthread_self())
+    );
+    
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+    
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_SHUTDOWN)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+    put_event_header(buffer, MR_TS_EVENT_SHUTDOWN, get_current_time_nanosecs());
+
+    flush_event_buffer(buffer);
+    eng->MR_eng_ts_buffer = NULL;
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+#if 0
+/*
+** It looks like we don't need this on modern CPUs including multi-socket
+** systems (goliath).  If we find systems where this is needed we can enable it
+** via a runtime check.
+*/
+/*
+** The synchronization of TSCs operates as follows:
+** The master and slave enter their functions.  Both threads spin until the
+** other is ready (signaling the other before they begin spinning).  Then for
+** MR_TSC_SYNC_NUM_ROUNDS: The master spins waiting for the slave.  The slave
+** records it's current TSC, signals the master and spins waiting for a reply.
+** The master upon hearing from the slave records it's TSC and then signals
+** the slave.  The slave can then compute the delay in this round.  The slave
+** takes the NR_TSC_SYNC_NUM_BEST_ROUNDS best delays (smallest) and computes
+** the offset as the average between between the difference of the clocks based
+** on Cristan's algorithm (1989).
+*/
+
+typedef struct {
+    Timedelta   delay;
+    Timedelta   offset;
+} TimeDelayOffset;
+
+static MercuryLock          MR_tsc_sync_slave_lock;
+volatile static MR_Us_Cond  MR_tsc_sync_slave_entry_cond; 
+volatile static MR_Us_Cond  MR_tsc_sync_master_entry_cond; 
+volatile static MR_Us_Cond  MR_tsc_sync_t0; 
+volatile static MR_Us_Cond  MR_tsc_sync_t1; 
+static Time                 MR_tsc_sync_master_time;
+
+static int
+compare_time_delay_offset_by_delay(const void *a, const void *b); 
+
+void
+MR_threadscope_sync_tsc_master(void)
+{
+    unsigned i;
+
+    /*
+    ** Wait for a slave to enter.
+    */
+    MR_US_COND_SET(&MR_tsc_sync_master_entry_cond);
+    MR_US_SPIN_COND(&MR_tsc_sync_slave_entry_cond);
+    MR_US_COND_CLEAR(&MR_tsc_sync_slave_entry_cond);
+   
+    for (i = 0; i < MR_TSC_SYNC_NUM_ROUNDS; i++) {
+        /*
+        ** Wait to receive the message from the slave at T0
+        */
+        MR_US_SPIN_COND(&MR_tsc_sync_t0);
+        MR_US_COND_CLEAR(&MR_tsc_sync_t0);
+
+        /*
+        ** Read our TSC and send the slave a message.
+        */
+        MR_tsc_sync_master_time = MR_read_cpu_tsc();
+        MR_US_COND_SET(&MR_tsc_sync_t1);
+    }
+
+}
+
+void
+MR_threadscope_sync_tsc_slave(void)
+{
+    unsigned        i, j;
+    TimeDelayOffset delay_offset[MR_TSC_SYNC_NUM_ROUNDS];
+    Timedelta       total_offset;
+    MercuryEngine   *eng = MR_thread_engine_base;
+
+    /*
+    ** Only one slave may enter at a time.
+    */
+    MR_LOCK(&MR_tsc_sync_slave_lock, "MR_threadscope_sync_tsc_slave");
+
+    /*
+    ** Tell the master we're ready to begin and wait for it to tell us it's ready.
+    */
+    MR_US_COND_SET(&MR_tsc_sync_slave_entry_cond);
+    MR_US_SPIN_COND(&MR_tsc_sync_master_entry_cond);
+    MR_US_COND_CLEAR(&MR_tsc_sync_master_entry_cond);
+    
+    for (i = 0; i < MR_TSC_SYNC_NUM_ROUNDS; i++) {
+        Time    slave_tsc_at_t0;
+        Time    slave_tsc_at_t2;
+
+        /*
+        ** Get the current time and signal that we've done so (T=0).
+        */
+        slave_tsc_at_t0 = MR_read_cpu_tsc();
+        MR_US_COND_SET(&MR_tsc_sync_t0);
+
+        /*
+        ** Wait for the master to reply, the master handles T=1, here we
+        ** proceed to T=2.
+        */
+        MR_US_SPIN_COND(&MR_tsc_sync_t1);
+        slave_tsc_at_t2 = MR_read_cpu_tsc();
+        MR_US_COND_CLEAR(&MR_tsc_sync_t1);
+
+        /*
+        ** Here are Cristian's formulas.  Delay is the round trip time,
+        ** slave_tsc_at_t0 + delay/2 is the time on the slave's clock that the
+        ** master processed the slaves message and sent it's own.  This is
+        ** accurate if the communication delays in either direction are
+        ** uniform, that is communication latency is synchronous.
+        */
+        delay_offset[i].delay = slave_tsc_at_t2 - slave_tsc_at_t0;
+        delay_offset[i].offset = 
+            MR_tsc_sync_master_time - (slave_tsc_at_t0 + delay_offset[i].delay/2);
+    }
+    /* By now the master thread will return and continue with it's normal work. */
+
+    /*
+    ** We do this debugging output while holding the lock, so that the output
+    ** is reasonable.
+    */
+    MR_DO_THREADSCOPE_DEBUG({
+        fprintf(stderr, "TSC Synchronization for thread 0x%x\n", pthread_self()); 
+        for (i = 0; i < MR_TSC_SYNC_NUM_ROUNDS; i++) {
+            fprintf(stderr, "delay: %ld offset (local + global = total) %ld + %ld = %ld\n",
+                delay_offset[i].delay, delay_offset[i].offset, MR_global_offset, 
+                delay_offset[i].offset + MR_global_offset);
+        }
+    });
+    MR_UNLOCK(&MR_tsc_sync_slave_lock, "MR_threadscope_sync_tsc_slave");
+    
+    /*
+    ** Now to average the best offsets.
+    */
+    qsort(&delay_offset, MR_TSC_SYNC_NUM_ROUNDS, sizeof(TimeDelayOffset), 
+        compare_time_delay_offset_by_delay);
+    total_offset = 0;
+    for (i = 0; i < MR_TSC_SYNC_NUM_BEST_ROUNDS; i++) {
+        total_offset = delay_offset[i].offset;
+    }
+    eng->MR_eng_cpu_clock_ticks_offset = total_offset + MR_global_offset;
+    
+    MR_DO_THREADSCOPE_DEBUG({
+        fprintf(stderr, "TSC Synchronization offset for thread 0x%x: %ld\n", 
+            pthread_self(), eng->MR_eng_cpu_clock_ticks_offset);
+    });
+}
+
+static int
+compare_time_delay_offset_by_delay(const void *a, const void *b) {
+    TimeDelayOffset *tdo_a = (TimeDelayOffset*)a;
+    TimeDelayOffset *tdo_b = (TimeDelayOffset*)b;
+
+    if (tdo_a->delay > tdo_b->delay) {
+        return 1;
+    } else if (tdo_a->delay < tdo_b->delay) {
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+#endif
+
+/***************************************************************************/
+
+void
+MR_threadscope_post_create_context(MR_Context *context)
+{
+    struct MR_threadscope_event_buffer *buffer = MR_ENGINE(MR_eng_ts_buffer);
+  
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+    
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_CREATE_THREAD)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+
+    put_event_header(buffer, MR_TS_EVENT_CREATE_THREAD, get_current_time_nanosecs());
+    put_context_id(buffer, context->MR_ctxt_num_id);
+
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+void
+MR_threadscope_post_create_context_for_spark(MR_Context *context)
+{
+    struct MR_threadscope_event_buffer *buffer = MR_ENGINE(MR_eng_ts_buffer);
+  
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+    
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_CREATE_SPARK_THREAD)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+
+    put_event_header(buffer, MR_TS_EVENT_CREATE_SPARK_THREAD,
+        get_current_time_nanosecs());
+    put_context_id(buffer, context->MR_ctxt_num_id);
+
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+void
+MR_threadscope_post_context_runnable(MR_Context *context)
+{
+    struct MR_threadscope_event_buffer *buffer = MR_ENGINE(MR_eng_ts_buffer);
+
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_THREAD_RUNNABLE)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+
+    put_event_header(buffer, MR_TS_EVENT_THREAD_RUNNABLE, get_current_time_nanosecs());
+    put_context_id(buffer, context->MR_ctxt_num_id);
+
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+static void
+MR_threadscope_post_run_context_locked(
+    struct MR_threadscope_event_buffer *buffer,
+    MR_Context *context)
+{
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_RUN_THREAD)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+    
+    put_event_header(buffer, MR_TS_EVENT_RUN_THREAD, 
+        get_current_time_nanosecs());
+    put_context_id(buffer, 
+        MR_thread_engine_base->MR_eng_this_context->MR_ctxt_num_id);
+}
+
+void
+MR_threadscope_post_run_context(void)
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+   
+    context = MR_thread_engine_base->MR_eng_this_context;
+
+    if (context) {
+        MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+        MR_threadscope_post_run_context_locked(buffer, context);
+        MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+    }
+}
+
+static void
+MR_threadscope_post_stop_context_locked(
+    struct MR_threadscope_event_buffer *buffer,
+    MR_Context *context,
+    MR_ContextStopReason reason)
+{
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_STOP_THREAD)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+    
+    put_event_header(buffer, MR_TS_EVENT_STOP_THREAD, get_current_time_nanosecs());
+    put_context_id(buffer, context->MR_ctxt_num_id);
+    put_stop_reason(buffer, reason); 
+}
+
+void
+MR_threadscope_post_stop_context(MR_ContextStopReason reason) 
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+    context = MR_thread_engine_base->MR_eng_this_context;
+
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+    MR_threadscope_post_stop_context_locked(buffer, context, reason); 
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+extern void
+MR_threadscope_post_calling_main(void) {
+    struct MR_threadscope_event_buffer *buffer = MR_ENGINE(MR_eng_ts_buffer);
+    
+    MR_US_SPIN_LOCK(&(buffer->MR_tsbuffer_lock));
+    if (!enough_room_for_event(buffer, MR_TS_EVENT_CALL_MAIN)) {
+        flush_event_buffer(buffer);
+        open_block(buffer);
+    } else if (!block_is_open(buffer)) {
+        open_block(buffer);
+    }
+    
+    put_event_header(buffer, MR_TS_EVENT_CALL_MAIN, get_current_time_nanosecs());
+    MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+}
+
+/***************************************************************************/
+
+static struct MR_threadscope_event_buffer* 
+MR_create_event_buffer(void)
+{
+    struct MR_threadscope_event_buffer* buffer;
+
+    buffer = MR_GC_NEW(MR_threadscope_event_buffer_t);
+    buffer->MR_tsbuffer_pos = 0;
+    buffer->MR_tsbuffer_block_open_pos = -1;
+    buffer->MR_tsbuffer_lock = MR_US_LOCK_INITIAL_VALUE;
+
+    return buffer;
+}
+
+/***************************************************************************/
+    
+static void 
+MR_open_output_file_and_write_prelude(void)
+{
+    MR_Unsigned     filename_len;
+    char            *progname_copy;
+    char            *progname_base;
+    MR_Unsigned     i;
+
+    progname_copy = strdup(MR_progname);
+    progname_base = basename(progname_copy);
+
+    /*
+    ** This is an over-approximation on the amount of space needed for this
+    ** filename.
+    */
+    filename_len = strlen(progname_base) + strlen(MR_TS_FILENAME_FORMAT) + 1;
+    MR_threadscope_output_filename = MR_GC_NEW_ARRAY(char, filename_len);
+    snprintf(MR_threadscope_output_filename, filename_len, 
+        MR_TS_FILENAME_FORMAT, progname_base);
+    free(progname_copy);
+    progname_copy = NULL;
+    progname_base = NULL;
+
+    MR_threadscope_output_file = fopen(MR_threadscope_output_filename, "w");
+    if (!MR_threadscope_output_file) {
+        perror(MR_threadscope_output_filename);
+        return;
+    }
+
+    put_be_uint32(&global_buffer, MR_TS_EVENT_HEADER_BEGIN);
+    put_be_uint32(&global_buffer, MR_TS_EVENT_HET_BEGIN);
+    for ( i = 0; 
+          event_type_descs[i].etd_event_type != MR_TS_NUM_EVENT_TAGS;
+          i++) {
+        put_event_type(&global_buffer, &event_type_descs[i]);
+    }
+    put_be_uint32(&global_buffer, MR_TS_EVENT_HET_END);
+    put_be_uint32(&global_buffer, MR_TS_EVENT_HEADER_END);
+    put_be_uint32(&global_buffer, MR_TS_EVENT_DATA_BEGIN);
+
+    flush_event_buffer(&global_buffer);
+}
+
+static void 
+MR_close_output_file(void)
+{
+    if (MR_threadscope_output_file) {
+        put_be_uint16(&global_buffer, MR_TS_EVENT_DATA_END);
+        if (flush_event_buffer(&global_buffer)) {
+            if (EOF == fclose(MR_threadscope_output_file)) {
+                perror(MR_threadscope_output_filename);
+            }
+            MR_threadscope_output_file = NULL;
+            MR_threadscope_output_filename = NULL;
+        }
+    }
+}
+
+static void
+put_event_type(struct MR_threadscope_event_buffer *buffer, EventTypeDesc *event_type)
+{
+    put_be_uint32(buffer, MR_TS_EVENT_ET_BEGIN);
+
+    put_be_uint16(buffer, event_type->etd_event_type);
+    put_be_int16(buffer, event_type_sizes[event_type->etd_event_type]);
+
+    put_string(buffer, event_type->etd_description);
+
+    /* There is no extended data in any of our events */
+    put_be_uint32(buffer, 0);
+    
+    put_be_uint32(buffer, MR_TS_EVENT_ET_END);
+}
+
+static MR_bool 
+flush_event_buffer(struct MR_threadscope_event_buffer *buffer) 
+{
+    maybe_close_block(buffer);
+
+    /*
+    ** fwrite handles locking for us, so we have no concurrent access problems.
+    */
+    if (MR_threadscope_output_file && buffer->MR_tsbuffer_pos) {
+        if (0 == fwrite(buffer->MR_tsbuffer_data, buffer->MR_tsbuffer_pos, 1, 
+                MR_threadscope_output_file)) {
+            perror(MR_threadscope_output_filename);
+            MR_threadscope_output_file = NULL;
+            MR_threadscope_output_filename = NULL;
+        }
+    }
+    buffer->MR_tsbuffer_pos = 0;
+
+    return (MR_threadscope_output_filename ? MR_TRUE : MR_FALSE);
+}
+
+static void
+maybe_close_block(struct MR_threadscope_event_buffer *buffer)
+{
+    MR_Unsigned                         saved_pos;
+
+    if (buffer->MR_tsbuffer_block_open_pos != -1)
+    {
+        saved_pos = buffer->MR_tsbuffer_pos;
+        buffer->MR_tsbuffer_pos = buffer->MR_tsbuffer_block_open_pos +
+            sizeof(EventType) + sizeof(Time);
+        put_eventlog_offset(buffer, saved_pos - buffer->MR_tsbuffer_block_open_pos);
+        put_timestamp(buffer, get_current_time_nanosecs());
+
+        buffer->MR_tsbuffer_block_open_pos = -1;
+        buffer->MR_tsbuffer_pos = saved_pos;
+    }
+}
+
+static void
+open_block(struct MR_threadscope_event_buffer *buffer)
+{
+    maybe_close_block(buffer);
+
+    /*
+    ** Save the old position, close block uses this so that it knows where the
+    ** block maker is that it should write into.
+    */
+    buffer->MR_tsbuffer_block_open_pos = buffer->MR_tsbuffer_pos;
+
+    put_event_header(buffer, MR_TS_EVENT_BLOCK_MARKER, get_current_time_nanosecs());
+
+    /* Skip over the next two fields, they are filled in by close_block */
+    buffer->MR_tsbuffer_pos += sizeof(EventlogOffset) + sizeof(Time); 
+
+    put_engine_id(buffer, MR_ENGINE(MR_eng_id));
+}
+
+static void 
+start_gc_callback(void) 
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In gc start callback thread: 0x%lx\n", pthread_self())
+    );
+    if (MR_thread_engine_base == NULL) return; 
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+    if (buffer == NULL) {
+        /* GC might be running before we're done setting up */
+        return; 
+    }
+
+    if (MR_US_TRY_LOCK(&(buffer->MR_tsbuffer_lock))) {
+        context = MR_thread_engine_base->MR_eng_this_context;
+        if (context) {
+            MR_threadscope_post_stop_context_locked(buffer,
+                context, MR_TS_STOP_REASON_HEAP_OVERFLOW);
+        }
+
+        if (!enough_room_for_event(buffer, MR_TS_EVENT_GC_START)) {
+            flush_event_buffer(buffer);
+            open_block(buffer);
+        } else if (!block_is_open(buffer)) {
+            open_block(buffer);
+        }
+        
+        put_event_header(buffer, MR_TS_EVENT_GC_START,
+            get_current_time_nanosecs());
+        MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+    }
+}
+
+static void 
+stop_gc_callback(void) 
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In gc stop callback thread: 0x%lx\n", pthread_self());
+    );
+    if (MR_thread_engine_base == NULL) return; 
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+    if (buffer == NULL) {
+        /* GC might be running before we're done setting up */
+        return; 
+    }
+    
+    if (MR_US_TRY_LOCK(&(buffer->MR_tsbuffer_lock))) {
+        if (!enough_room_for_event(buffer, MR_TS_EVENT_GC_END)) {
+            flush_event_buffer(buffer);
+            open_block(buffer);
+        } else if (!block_is_open(buffer)) {
+            open_block(buffer);
+        }
+        
+        put_event_header(buffer, MR_TS_EVENT_GC_END, get_current_time_nanosecs());
+        
+        context = MR_thread_engine_base->MR_eng_this_context;
+        if (context) {
+            MR_threadscope_post_run_context_locked(buffer, context);
+        }
+        MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+    }
+}
+
+static void 
+pause_thread_gc_callback(void) 
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In gc pause thread callback thread: 0x%lx\n", pthread_self())
+    );
+    if (MR_thread_engine_base == NULL) return; 
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+    if (buffer == NULL) {
+        /* GC might be running before we're done setting up */
+        return; 
+    }
+
+    context = MR_thread_engine_base->MR_eng_this_context;
+    if (context && MR_US_TRY_LOCK(&(buffer->MR_tsbuffer_lock))) {
+        MR_threadscope_post_stop_context_locked(buffer, context, MR_TS_STOP_REASON_YIELDING);
+        MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+    }
+}
+
+static void 
+resume_thread_gc_callback(void) 
+{
+    struct MR_threadscope_event_buffer  *buffer;
+    MR_Context                          *context;
+
+    MR_DO_THREADSCOPE_DEBUG(
+        fprintf(stderr, "In gc resume thread callback thread: 0x%lx\n", pthread_self());
+    );
+    if (MR_thread_engine_base == NULL) return; 
+    buffer = MR_thread_engine_base->MR_eng_ts_buffer;
+    if (buffer == NULL) {
+        /* GC might be running before we're done setting up */
+        return; 
+    }
+   
+    context = MR_thread_engine_base->MR_eng_this_context;
+    if (context && MR_US_TRY_LOCK(&(buffer->MR_tsbuffer_lock))) {
+        MR_threadscope_post_run_context_locked(buffer, context);
+        MR_US_UNLOCK(&(buffer->MR_tsbuffer_lock));
+    }
+}
+
+/***************************************************************************/
+
+static MR_uint_least64_t
+get_current_time_nanosecs(void)
+{
+    MR_uint_least64_t   current_tsc;
+    MercuryEngine       *eng = MR_thread_engine_base;
+
+    current_tsc = MR_read_cpu_tsc();
+    return (current_tsc + eng->MR_eng_cpu_clock_ticks_offset) / 
+        (MR_cpu_cycles_per_sec / 1000000000);
+}
+
+/***************************************************************************/
+
+#endif /* defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT) */
+
+/* forward decls to suppress gcc warnings */
+void mercury_sys_init_threadscope_init(void);
+void mercury_sys_init_threadscope_init_type_tables(void);
+#ifdef  MR_DEEP_PROFILING
+void mercury_sys_init_threadscope_write_out_proc_statics(FILE *fp);
+#endif
+
+void mercury_sys_init_threadscope_init(void)
+{
+#ifndef MR_HIGHLEVEL_CODE
+/* XXX: What does this do?  Why do other modules have a call like this.
+    threadscope_module();
+*/
+#endif
+}
+
+void mercury_sys_init_threadscope_init_type_tables(void)
+{
+    /* no types to register */
+}
+
+#ifdef  MR_DEEP_PROFILING
+void mercury_sys_init_threadscope_write_out_proc_statics(FILE *fp)
+{
+    /* no proc_statics to write out */
+}
+#endif
Index: runtime/mercury_threadscope.h
===================================================================
RCS file: runtime/mercury_threadscope.h
diff -N runtime/mercury_threadscope.h
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ runtime/mercury_threadscope.h	1 Dec 2009 06:17:07 -0000
@@ -0,0 +1,134 @@
+/*
+** vim:ts=4 sw=4 expandtab
+*/
+/*
+** Copyright (C) 2009 The University of Melbourne.
+**
+** This file may only be copied under the terms of the GNU Library General
+** Public License - see the file COPYING.LIB in the Mercury distribution.
+*/
+
+/*
+** mercury_threadscope.h - defines Mercury threadscope profiling support.
+**
+** See "Parallel Preformance Tuning for Haskell" - Don Jones Jr, Simon Marlow
+** and Satnam Singh for information about threadscope.
+*/
+
+#ifndef MERCURY_THREADSCOPE_H
+#define MERCURY_THREADSCOPE_H
+
+#include "mercury_types.h"      /* for MR_Word, MR_Code, etc */
+#include "mercury_engine.h"
+#include "mercury_context.h"
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+
+/*
+** Reasons why a context has been stopped, not all of these apply to Mercury,
+** for instance contexts don't yield.
+*/
+#define MR_TS_STOP_REASON_HEAP_OVERFLOW     1
+#define MR_TS_STOP_REASON_STACK_OVERFLOW    2
+#define MR_TS_STOP_REASON_YIELDING          3
+#define MR_TS_STOP_REASON_BLOCKED           4
+#define MR_TS_STOP_REASON_FINISHED          5
+
+typedef struct MR_threadscope_event_buffer MR_threadscope_event_buffer_t;
+
+typedef MR_uint_least16_t   MR_EngineId;
+typedef MR_uint_least16_t   MR_ContextStopReason;
+typedef MR_uint_least32_t   MR_ContextId;
+
+/*
+** This must be called by the primordial thread before starting any other
+** threads but after the primordial thread has been pinned.
+*/
+extern void
+MR_setup_threadscope(void);
+
+extern void
+MR_finalize_threadscope(void);
+
+extern void
+MR_threadscope_setup_engine(MercuryEngine *eng);
+
+extern void
+MR_threadscope_finalize_engine(MercuryEngine *eng);
+
+#if 0
+/*
+** It looks like we don't need TSC synchronization code on modern x86(-64) CPUs
+** including multi-socket systems (tested on goliath and taura).  If we find
+** systems where this is needed we can enable it via a runtime check.
+*/
+/*
+** Synchronize a slave thread's TSC offset to the master's.  The master thread
+** (with an engine) should call MR_threadscope_sync_tsc_master() for each slave
+** while each slave (with an engine) calls MR_threadscope_sync_tsc_slave().
+** All master - slave pairs must be pinned to CPUs and setup their threadscope
+** structures already (by calling MR_threadscope_setup_engine() above).
+** Multiple slaves may call the _slave at the same time, a lock is used to
+** synchronize only one at a time.  Only the primordial thread may call
+** MR_threadscope_sync_tsc_master().
+*/
+extern void
+MR_threadscope_sync_tsc_master(void);
+extern void
+MR_threadscope_sync_tsc_slave(void);
+#endif
+
+/*
+** Use the following functions to post messages.  All messages will read the
+** current engine's ID from the engine word, some messages will also read the
+** current context id from the context loaded into the current engine.
+*/
+
+/*
+** This context has been created,  The context must be passed as a parameter so
+** that it doesn't have to be the current context.
+**
+** Using the MR_Context typedef here requires the inclusion of
+** mercury_context.h, creating a circular dependency
+*/
+extern void
+MR_threadscope_post_create_context(struct MR_Context_Struct *context);
+
+/*
+** The given context was created in order to execute a spark.  It's an
+** alternative to the above event.
+*/
+extern void
+MR_threadscope_post_create_context_for_spark(struct MR_Context_Struct *ctxt);
+
+/*
+** This message says the context is now ready to run.  Such as it's being
+** placed on the run queue after being blocked
+*/
+extern void
+MR_threadscope_post_context_runnable(struct MR_Context_Struct *context);
+
+/*
+** This message says we're now running the current context
+*/
+extern void
+MR_threadscope_post_run_context(void);
+
+/*
+** This message says we've stopped executing the current context,
+** a reason why should be provided.
+*/
+extern void
+MR_threadscope_post_stop_context(MR_ContextStopReason reason);
+
+/*
+** Post this message just before invoking the main/2 predicate.
+*/
+extern void
+MR_threadscope_post_calling_main(void);
+
+#endif /* defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT) */
+
+#endif /* not MERCURY_THREADSCOPE_H */
Index: runtime/mercury_wrapper.c
===================================================================
RCS file: /home/mercury1/repository/mercury/runtime/mercury_wrapper.c,v
retrieving revision 1.202
diff -u -p -b -r1.202 mercury_wrapper.c
--- runtime/mercury_wrapper.c	30 Nov 2009 23:24:40 -0000	1.202
+++ runtime/mercury_wrapper.c	1 Dec 2009 05:43:41 -0000
@@ -65,6 +65,7 @@ ENDINIT
 #include    "mercury_memory.h"          /* for MR_copy_string() */
 #include    "mercury_memory_handlers.h" /* for MR_default_handler */
 #include    "mercury_thread.h"          /* for MR_debug_threads */
+#include    "mercury_threadscope.h"
 
 #if defined(MR_HAVE__SNPRINTF) && ! defined(MR_HAVE_SNPRINTF)
   #define snprintf	_snprintf
@@ -529,12 +530,13 @@ mercury_runtime_init(int argc, char **ar
 
 #if defined(MR_THREAD_SAFE) && defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
     /*
-    ** Setup support for reading the CPU's TSC.  This is currently used by
-    ** profiling of the parallelism runtime but may be used by other profiling
+    ** Setup support for reading the CPU's TSC and detect the clock speed of the
+    ** processor.  This is currently used by profiling of the parallelism
+    ** runtime and the threadscope support but may be used by other profiling
     ** or timing code.  On architectures other than i386 and amd64 this is a
     ** no-op.
     */
-    MR_configure_profiling_timers();
+    MR_do_cpu_feature_detection();
 #endif 
 
     /*
@@ -614,6 +616,18 @@ mercury_runtime_init(int argc, char **ar
     MR_ticket_high_water = 1;
   #endif
 #else
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ)
+    MR_pin_primordial_thread();
+#if defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    /*
+    ** We must setup threadscope before we setup the first engine.
+    ** Pin the primordial thread, if thread pinning is configured.
+    */
+    MR_setup_threadscope();
+#endif
+#endif
+
     /*
     ** Start up the Mercury engine.  We don't yet know how many slots will be
     ** needed for thread-local mutable values so allocate the maximum number.
@@ -627,10 +641,20 @@ mercury_runtime_init(int argc, char **ar
         int i;
 
         MR_exit_now = MR_FALSE;
-        for (i = 1 ; i < MR_num_threads ; i++) {
+        
+        for (i = 1; i < MR_num_threads; i++) {
             MR_create_thread(NULL);
         }
-        MR_pin_thread();
+    #ifdef MR_PROFILE_PARALLEL_EXECUTION_SUPPORT
+    /*
+    ** TSC Synchronization is not used support is commented out.  See
+    ** runtime/mercury_threadscope.h
+    **
+        for (i = 1; i < MR_num_threads; i++) {
+            MR_threadscope_sync_tsc_master();
+        }
+    */
+    #endif
         while (MR_num_idle_engines < MR_num_threads-1) {
             /* busy wait until the worker threads are ready */
             MR_ATOMIC_PAUSE;
@@ -2413,6 +2437,13 @@ mercury_runtime_main(void)
         MR_setup_callback(MR_program_entry_point);
 #endif
 
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+
+        MR_threadscope_post_calling_main();
+
+#endif
+
 #ifdef MR_HIGHLEVEL_CODE
         MR_do_interpreter();
 #else
@@ -2421,6 +2452,13 @@ mercury_runtime_main(void)
         MR_debugmsg0("Returning from MR_call_engine()\n");
 #endif
 
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+
+        MR_threadscope_post_stop_context(MR_TS_STOP_REASON_FINISHED);
+
+#endif
+
 #ifdef  MR_DEEP_PROFILING
         MR_current_call_site_dynamic = saved_cur_csd;
         MR_current_callback_site = saved_cur_callback;
@@ -2929,6 +2967,18 @@ mercury_runtime_terminate(void)
     pthread_cond_broadcast(&MR_runqueue_cond);
     MR_UNLOCK(&MR_runqueue_lock, "exit_now");
 
+    while (MR_num_exited_engines < MR_num_threads - 1) {
+        MR_ATOMIC_PAUSE;
+    }
+
+#if defined(MR_THREAD_SAFE) && defined(MR_LL_PARALLEL_CONJ) && \
+        defined(MR_PROFILE_PARALLEL_EXECUTION_SUPPORT)
+    if (MR_ENGINE(MR_eng_ts_buffer))
+        MR_threadscope_finalize_engine(MR_thread_engine_base);
+
+    MR_finalize_threadscope();
+#endif
+    
     assert(MR_primordial_thread == pthread_self());
     MR_primordial_thread = (MercuryThread) 0;
     
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 489 bytes
Desc: Digital signature
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20091201/759f1341/attachment.sig>
Previous message: [m-rev.] diff: Fix the recently broken hlc.par grades.
Next message: [m-rev.] For review: Implement support for threadscope profiling.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list