[m-rev.] Performance of thread pooling on Java backend
Paul Bone
paul at bone.id.au
Fri Oct 3 19:30:25 AEST 2014
I've informally benchmarked the new thread pooling feature on the Java
backend and included some other results for comparison.
These benchmarks are informal, specifically I used only ten runs of each
test and I _did not_ disable frequency scaling in my computer's BIOS,
therefore the default behaviour of increasing the clockspeed when only a
single core is in use will be active. My processor has 4 cores and 2
threads per core, Mercury (in Java and low-level C) will use 8 threads, the
Java backend will add extra threads if necessary.
I used the mandelbrot benchmark (I'll post a patch soon as I've modified
it) in several modes: no (no parallelism), spawn_native (use
thread.spawn_native to create threads and thread.mvar for synchronization),
spawn (as above but may use thread pools/contexts where applicable) and conj
(independent parallel conjunctions).
I tested in three different grades. asm_fast.gc.par.stseg (shown as "llc"),
hlc.gc.par (shown as hlc) and java. Ratios in parentheses are given in
percent compared to no parallelism (which is not fair).
no spawn_native spawn conj
llc 24.41 5.81 (4.20) 5.80 (4.21) 5.65 (4.32)
hlc 13.09 2.75 (4.76) N/A N/A
java 13.02 5.74 (2.27) 5.58 (2.33) N/A
Wow, the low level C grade is much slower than the others. We've noticed
that it can be slower in other benchmarks but this may be the most dramatic
difference that I've observed. There are a number of potential reasons why
it is slower. One theory is that as C compilers have gotten "smarter" we've
had to add a number of volotile assmelber barriers to Mercury's generated C
code to prevent them from breaking our code - which can affect their
optimisations generally.
Next, we notice that the java grade doesn't parallelism very well generally.
I've heard non-specific complaints against Java's parallel performance but
It's never been clear to me what _aspect_ of Java's parallel performance was
the problem. In this case however the runtimes are small and part of that
runtime is spent starting up the VM, which at least historically has been
slow. Therefore these results may be affected by Amdahl's law, that is the
parallelisable part of the program does not include the VM startup time.
Finally thread pooling does indeed make a noticeable difference; spawn is
about 3% faster than spawn_native for this benchmark. I hope that this
difference is even better on finer grained parallelism.
Also, in hindsight, these runtimes are rather short so there's a risk that
the results may be less accurate, for example due to Java's startup time and
Amdahl's law. I'll see if I can do some more testing with a larger image
and with frequency scaling disabled.
Thanks.
--
Paul Bone
More information about the reviews
mailing list