[mercury-users] Threaded processing slower than sequential?

Ondrej Bojar obo at cuni.cz
Tue Mar 7 05:19:39 AEDT 2006


Dear all.

I'm using hlc.gc.par grade and the concurrency library (rotd-2006-03-01) 
to perform parallel_map_fold (the fold is used on the io__state only).
The problem is that the threaded processing is either *much slower* than 
a simple map_foldl or simply never finishes when launched on reasonable 
input.

Every member of the list is mapped using the supplied predicate in a 
separate thread. Note that there are quite few elements in the list (6 
or 8) and I use a 4 CPU 64bit machine so I do not think the kernel is 
wasting all the time.

If the threads print out some debugging info to stderr, it slowly


The threaded processing works fine for processor-only computations such 
as the exponential implementation of Fibonacci numbers:

fibonacci threaded:
real    1m56.867s
user    4m24.006s
sys     0m0.021s

fibonacci sequential:
real    4m23.973s
user    4m23.960s
sys     0m0.014s

(See how the user time nicely matches in sequential and parallel mode.)

sequential processing (CPU+"structure" intensive)
real    2m28.318s
user    2m24.476s
sys     0m3.747s

By "structure intensive" I mean quite a lot of little data blocks, in 
total not more than 0.2% of the machine's 10GB are used.

The parallel version (just replaced map_foldl with parallel_map_fold) 
needed 38 minutes and produced no output to stdout (on the contrary to 
the sequential version).

real    38m25.459s
user    33m26.912s
sys     39m20.754s

(Another task, also RAM intensive, which took about 7 hours sequential 
did not finish after 4 days when threaded.)


I'm also surprised that not all the processors are used: with 4 threads 
of a process (and nothing else running on the machine), each of the 
threads seems to take only 50 % of a CPU, although there are 4 
processors in total.

22731 bojar     16   0 3929m  16m 1112 S 49.3  0.2   1:01.68 suggest_for
22733 bojar     16   0 3929m  16m 1112 R 47.9  0.2   1:01.65 suggest_for
22732 bojar     16   0 3929m  16m 1112 S 47.3  0.2   1:00.40 suggest_for
22730 bojar     16   0 3929m  16m 1112 R 45.9  0.2   0:59.01 suggest_for

(Process state for the threads oscillates between S and R.)

The fibonacci code behaves nicely and all once there are 4 threads only, 
each takes 99%.

I'm not very experienced with threaded programming, so I do not know 
where should I look for the reason.

I'm afraid I even won't be able to profile the program, since there is 
no hlc.gc.par.prof grade. (Is it possible to ask for this grade during 
installation, or would it be ignored anyway?)

Could it be the kernel?
2.6.12-1.1381_FC3smp #1 SMP Fri Oct 21 04:22:48 EDT 2005 x86_64 x86_64 
x86_64 GNU/Linux

Thanks again for any kind of hint,
   Ondrej.

-- 
Ondrej Bojar (mailto:obo at cuni.cz)
http://www.cuni.cz/~obo
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list