[mercury-users] Threaded processing slower than sequential?
Ondrej Bojar
obo at cuni.cz
Tue Mar 7 05:19:39 AEDT 2006
Dear all.
I'm using hlc.gc.par grade and the concurrency library (rotd-2006-03-01)
to perform parallel_map_fold (the fold is used on the io__state only).
The problem is that the threaded processing is either *much slower* than
a simple map_foldl or simply never finishes when launched on reasonable
input.
Every member of the list is mapped using the supplied predicate in a
separate thread. Note that there are quite few elements in the list (6
or 8) and I use a 4 CPU 64bit machine so I do not think the kernel is
wasting all the time.
If the threads print out some debugging info to stderr, it slowly
The threaded processing works fine for processor-only computations such
as the exponential implementation of Fibonacci numbers:
fibonacci threaded:
real 1m56.867s
user 4m24.006s
sys 0m0.021s
fibonacci sequential:
real 4m23.973s
user 4m23.960s
sys 0m0.014s
(See how the user time nicely matches in sequential and parallel mode.)
sequential processing (CPU+"structure" intensive)
real 2m28.318s
user 2m24.476s
sys 0m3.747s
By "structure intensive" I mean quite a lot of little data blocks, in
total not more than 0.2% of the machine's 10GB are used.
The parallel version (just replaced map_foldl with parallel_map_fold)
needed 38 minutes and produced no output to stdout (on the contrary to
the sequential version).
real 38m25.459s
user 33m26.912s
sys 39m20.754s
(Another task, also RAM intensive, which took about 7 hours sequential
did not finish after 4 days when threaded.)
I'm also surprised that not all the processors are used: with 4 threads
of a process (and nothing else running on the machine), each of the
threads seems to take only 50 % of a CPU, although there are 4
processors in total.
22731 bojar 16 0 3929m 16m 1112 S 49.3 0.2 1:01.68 suggest_for
22733 bojar 16 0 3929m 16m 1112 R 47.9 0.2 1:01.65 suggest_for
22732 bojar 16 0 3929m 16m 1112 S 47.3 0.2 1:00.40 suggest_for
22730 bojar 16 0 3929m 16m 1112 R 45.9 0.2 0:59.01 suggest_for
(Process state for the threads oscillates between S and R.)
The fibonacci code behaves nicely and all once there are 4 threads only,
each takes 99%.
I'm not very experienced with threaded programming, so I do not know
where should I look for the reason.
I'm afraid I even won't be able to profile the program, since there is
no hlc.gc.par.prof grade. (Is it possible to ask for this grade during
installation, or would it be ignored anyway?)
Could it be the kernel?
2.6.12-1.1381_FC3smp #1 SMP Fri Oct 21 04:22:48 EDT 2005 x86_64 x86_64
x86_64 GNU/Linux
Thanks again for any kind of hint,
Ondrej.
--
Ondrej Bojar (mailto:obo at cuni.cz)
http://www.cuni.cz/~obo
--------------------------------------------------------------------------
mercury-users mailing list
post: mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe: Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
More information about the users
mailing list