[m-users.] uncaught Mercury exception using mdprof_create_feedback
paul at bone.id.au
Fri Oct 10 09:47:18 AEDT 2014
On Thu, Oct 09, 2014 at 08:14:41PM +0200, Matthias Guedemann wrote:
> Hi Paul,
> > parallelism, which is too little to measure. I'll let you try it with
> > my feedback file and see what you find. Also you can see what the
> > autoparallelisation recommends by using the mdprof_report_feedback
> > tool to translate the feedback file into a human readable form.
> > I also tried the --no-ipar-dep-conjs flag, in that case the automatic
> > analysis chooses the ame parallelisation as you did, and predicts a
> > speedup of 2.5.
> It seems that both versions result in approximately the same speedup,
> with a slight advantage for the implicit parallelism version.
> On a dual core i5 with hyperthreading I get the following results using
> the asm_fast.par.gc.stseg grade.
> Two runs for the version of the feedback file:
> ./euler_82 167.20s user 0.10s system 275% cpu 1:00.68 total
> ./euler_82 177.42s user 0.09s system 294% cpu 1:00.18 total
> and this for the manual version:
> ./euler_82 180.83s user 0.12s system 289% cpu 1:02.44 total
> ./euler_82 173.65s user 0.07s system 282% cpu 1:01.43 total
> The non-parallel version in asm_fast grade gives:
> ./euler_82 91.04s user 0.06s system 99% cpu 1:31.37 total
> ./euler_82 91.55s user 0.05s system 100% cpu 1:31.50 total
> Is this consistent with your experiences / expectations?
It's hard to say because I'm not familiar with the program or it's data.
It might also be better on a quad core system.
One thing that I notice (this is rhetorical, I'll use it to demonstrate the
answer) is that the sequential version takes about 92 seconds, the (manual)
parallel one takes about 62 seconds. This is a speedup of approximately
1.5, on a dual core machine the goal is just less than 2.0 depending on the
program. So it looks like this program hasn't parallelised well.
So why didn't we achieve this speedup? The first answer we usually think of
is that some part of the program wasn't parallelised, such as some startup
time (see Amdahl's law). I don't think that this is significant in this
program. In the comparison above we compared asm_fast.gc.par.stseg with
asm_fast.gc, the former is known to be slower for some programs, so
comparing these figures doesn't tell us if we got the best speedup that we
possibly could (2.0). However, if we compared asm_fast.gc.par.stseg with
one thread, with asm_fast.gc.par.stseg with multiple threads then that
wouldn't tell us how worthwhile parallelisation is. The point I'm making is
that the data you need, depends on the question you're trying to answer, and
that the results above don't tell us if the automatic or manual
parallelisation is poor, or the parallel execution is poor. To answer this
we usually need the following:
+ The known best sequential performance, this is usually asm_fast.gc.
+ The parallel performance, asm_fast.gc.par.stseg with multiple threads,
+ And the performance of enabling parallelism, but not using it,
asm_fast.gc.par.stseg with one thread. This gives us an idea of the
costs/overheads of parallelism.
You can control the number of threads via the -P argument in the
MERCURY_OPTIONS environment variable.
You shouldn't feel the need to do this, but if you do I'd be happy to take a
look at the results. I am pleased that the automatic parallelisation edged
out the manual parallelisation, even if it was by only 1 or 2%, (I believe
this is the first case where there's been a difference and a parallel
speedup). This is especially good on a dual core machine which requires
less parallelism for full utilisation.
Yes, I'm ignoring hyperthreading, it does make a difference but it's more
a curate to pretend it's not there than to count the threads as independent
processors and assuming something like a maximum speedup of 4.0. On my four
processor hyper-threading system some programs get speedups of about 4.3.
Thanks for the update.
More information about the users