[mercury-users] Threads severely hurt performance

Ondrej Bojar obo at cuni.cz
Sat Apr 8 01:55:38 AEST 2006


Hi.

Just to let you know. It is possible to compile Mercury with garbage 
collector supporting THREAD_LOCAL_ALLOC by setting:

         CFLAGS_FOR_THREADS="-DMR_THREAD_SAFE -DLINUX_THREADS \
             -D_THREAD_SAFE -D_REENTRANT -DPARALLEL_MARK \
             -DTHREAD_LOCAL_ALLOC -DGENERIC_COMPARE_AND_SWAP"

in mercury/configure and commenting out the restriction to I386 (I need 
x64) in /boehm_gc/include/private/gc_locks.h:
416a417
 > #     if defined(I386)
436a438
 > #     endif /* I386 */




The results are by no means encouraging:

sequential processing:
real    0m0.852s
user    0m1.027s
sys     0m0.025s

treaded processing:
real    0m54.436s
user    0m46.113s
sys     0m59.877s

In other words: parallel processing in 4 threads on a 4-CPU machine 
takes nearly 60 times more than the sequential processing!

The test code was this:

main(!IO) :-
   List = [1,2,3,4],
   io__write_string("InList: ", !IO),
   io__write(List, !IO),
   io__nl(!IO),

   % threaded__list_map_io(
        % threaded__list_map_io runs the mapping predicate on all list
        % elements simultaneously and collects the results;
        % each of the threads has access to io__state
        % as concurrency/spawn.m supports it
   list__map_foldl(
     (pred(In::in, Out::out, !.I::di, !:I::uo) is det:-
       dump("Evaluating: ", In, !I),
       {Out,OutL} = fib(In+30),
       dump("Evaluated: ", {In, Out}, !I),
       true
     ), List, OutList, !IO),
   io__write_string("Final result: ", !IO),
   io__write(OutList, !IO),
   io__nl(!IO),
   true.

:- func fib(int) = {int, list(int)}.
fib(I) = Out :-
   if I =< 2
   then Out = {1, [1]}
   else
     { A, AL } = fib(I-2),
     { B, BL } = fib(I-1),
     Out = {A+B, AL}.

:- pred dump(string::in, T::in, io::di, io::uo) is det.
dump(Msg, V, !IO) :-
   io__write_string(Msg, !IO),
   io__write(V, !IO),
   io__nl(!IO).


The aim of the test code is to (dumbly) calculate Fibonacci numbers. In 
each thread, a dummy list is constructed just to simulate some 
operations on the heap. If the code would not touch heap at all, the 
parallel processing would indeed save some time.

The reason for the terrible slow down is most probably the fact that the 
garbage collector locks all the threads whenever it is about to check 
the heap.

Sigh.

Ondrej.


Peter Ross wrote:
> On Tue, Mar 21, 2006 at 09:31:53PM +0100, Ondrej Bojar wrote:
> 
>>Hi.
>>
>>I'm still hoping to use threads in my computations, but as I posted
>>recently and Peter Ross suggested, Boehm garbage collector wastes far
>>more time than the parallel processing can bring. After reading some of
>>the documentation, I believe that the gc needs to stop all the threads
>>too often to perform collection.
>>
>>Boehm gc supports some extra flags to improve performance:
>>-DPARALLEL_MARK for parallel marking and -DTHREAD_LOCAL_ALLOC for
>>separate collectors for individual threads. (Parallel mark should help 
>>straight away, thread local needs to be actually used in mmc.)
>>
> 
> My personal feeling, not having any evidence to back it up, is that
> it's allocation not garbage collection which is slowing the system down.
> I believe that the GC needs a lock to do allocation and because Mercury
> is so allocation intensive, we effectively serialize the code, thus
> for me getting THREAD_LOCAL_ALLOC is what is important.
> 
> In my reading of the documentation only linux support
> THREAD_LOCAL_ALLOC, but that it doesn't require any support from the
> compiler.  All it is is that each thread has their own free list to
> allocate from, thus avoiding contention on the free list.
> 
> So I would concentrate on getting THREAD_LOCAL_ALLOC working.
> 
> 
>>I tried to add these options to rotd/configure into
>>CFLAGS_FOR_THREADS for my platform (linux, x64), but mmc compiled with
>>these extra flags fails to link my programs:
>>
>>/home/bojar/opt/mercury-compiler-rotd-2006-03-18-x64-gcthread/lib/mercury/lib/libpar_gc.a(alloc.o)(.text+0x5a5): 
>>
>>In function `GC_set_fl_marks':
>>: undefined reference to `GC_compare_and_exchange'
>>/home/bojar/opt/mercury-compiler-rotd-2006-03-18-x64-gcthread/lib/mercury/lib/libpar_gc.a(mark.o)(.text+0xeb): 
>>
>>In function `GC_set_mark_bit':
>>: undefined reference to `GC_compare_and_exchange'
>>...
>>
>>As far as I understand, boehm_gc/configure (which would support option 
>>--enable-parallel-mark) is not called from generic configure, so I 
>>cannot check if it would work better.
>>
>>Would you recommend any workaround?
>>
> 
>>From the code, it looks like you need to define GENERIC_COMPARE_AND_SWAP
> see pthreads_support.c and search for GC_compare_and_exchange.
> 
> Pete
> --------------------------------------------------------------------------
> mercury-users mailing list
> post:  mercury-users at cs.mu.oz.au
> administrative address: owner-mercury-users at cs.mu.oz.au
> unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
> subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
> --------------------------------------------------------------------------
> 

-- 
Ondrej Bojar (mailto:obo at cuni.cz)
http://www.cuni.cz/~obo


--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list