[m-dev.] RBMM & CTGC

Marcus Troyka aeonioshaplo at gmail.com
Sat Apr 21 09:15:00 AEST 2012
Previous message: [m-dev.] Deep profiling assertion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've been reading over the theses related to memory management  
recently. I've read them all before, although generally I only ended  
up with a headache.
This time, however, I was able to extract some interesting details  
that it seems the original developers missed, and thought I'd share  
here for anyone interested.

RBMM:

1. It seems to me all the fancy liveness/structure sharing analysis  
work done for CTGC could rather easily be carried over and reused for  
RBMM. Considering that it already manages multiple modules and  
backwards liveness, I'd say that's a lot of work saved.

2. I found a few fairly significant errors in Quan Phan's thesis, in  
relation to the benchmarks section. The first error involves the  
following entries (in general):
 >>>>>> Heap ; RB-macro ; RB-func ; RB-func-Rtracking
boyer: 14.86 ; 14.57 ; 19.83 ; 19.85
crypt:   16.70 ; 14.66 ; 15.97 ; 16.11
healthy: 7.97 ; 10.73 ; 16.30 ; 6.46
life:        11.58 ; 12.13 ; 15.21 ; 10.54
queens: 19.47 ; 15.94 ; 19.70 ; 19.92
(runtime in seconds)

In his thesis, Quan related the poor performance of the boyer  
benchmark (and cases like it) to the overhead of saving region- 
tracking information. However, in the run-time table, you can clearly  
see that the difference in runtime for boyer between the plain- 
function and region-tracking-function versions was only 0.02 seconds.  
However, the difference between the versions implemented in macros  
and the versions implemented in functions is ~5 seconds, 2 orders of  
magnitude greater. This same pattern carries over for all of the  
examples; wherever region-tracking or macros are slower in run-time,  
the difference vs plain functions is negligible. Sometimes region- 
tracking or macros are faster, and sometimes both are faster, but  
neither is ever significantly slower, and each is dramatically faster  
in some cases. In the examples of life and healthy, both macros and  
region-tracking are dramatically faster.

Of course, the main argument against using macros is that they can as  
much as double filesize. However, the filesize savings gained by  
region-tracking should apply more significantly to macros since  
they're copied extensively, so the spread between macros and  
functions should become much more reasonable with region-tracking.

My guess as to the reason for this performance result is that,  
compared to the Boehm collector which is called relatively  
infrequently and is very expensive, the RM functions are called very  
often and are simple/cheap with respect to the cost of a function  
call. I consider it a promising result, however.

3. The other error is involves the memory usage statistics presented.  
While in general they are accurate, there are several cases where the  
reported statistics are misleading.

rqueens- max regions: 2 ; max size: 48 (theoretically) ; SLR: 24
rlife- max regions: 102 ; max size: 1856 ; SLR: 1722

If each region page is 2048 words in size, then with two regions  
queens cannot be running with less than 4096 words heap size, even if  
it only uses 48 of those. Life is even worse, at 102 regions it has a  
minimum heap size of 208,896 words, using only 1856 of those. While  
these figures are still very favorable compared to the Boehm  
collector, they demonstrate the potential space leak caused by "slack  
space" in small regions. Requesting 100 of these pages at a time also  
limits the minimum size of any program making at least one expansion  
to 204,800 words, which is still far sub-optimal.

MLKit (as reported by Makholm, 2000) deals with the former problem by  
determining which regions are known to be smaller than the minimum  
page size at compile time, and having those regions use "mini-pages"  
of perhaps 128-256 words. Makholm used 16 word pages by default,  
however this resulted in significant fragmentation (and overhead),  
which larger page sizes avoid. I would suggest using 1024 word  
default page sizes since that's what virtually all modern MMUs will  
return.

The latter problem might be solved by using exponential memory  
requests. Based on collected data, I would say (given 1024 word  
pages) starting with 4 pages, then 32, 128, 256, and 512 for each  
memory request would probably keep requests to a minimum without  
overbloating the heap unnecessarily. If any region is known to be  
larger than a certain size at compile time, the first request could  
be bumped up to the next largest size. Also, 4096 word pages could be  
used for larger regions known at compile time.

4. Stack allocation of regions is probably not a good idea. There are  
too many ways for it to interfere with other things allocated to the  
stack, and the stack behaves differently in parallel and non-parallel  
environments and between architectures. Region merging and static  
size analysis should provide the same performance improvement anyway.


CTGC vs RBMM:

After comparing the papers on RBMM and CTGC numerous times, I've  
finally come to some reliable conclusions about the interactions and  
relationship between CTGC and RBMM. Contrary to Quan, I don't see  
them as differing on the level of granularity, although at first  
glance that does seem to be the case. Rather, the level of  
granularity between the two is identical; both handle memory  
assignment on the level of individual cells, both use the exact same  
(basic) analysis information, and both are trying to reuse the same  
memory. The difference between them is the definition of locality  
implied: CTGC tries to conserve spatial locality whereas RBMM tries  
to conserve temporal locality.

What this means is that, if you were to hypothetically rewrite CTGC  
to only reuse memory cells that are allocated into the same region,  
then CTGC would produce no local reuse because of the opposite  
behavior of the two.

In my paper-digging efforts, I found a few benchmarks for which data  
is available for both CTGC and RBMM, and what I found (although  
tentative) was interesting. The only three benchmarks for which I  
could find data for both RBMM and CTGC were nrev, qsort, and queens.  
Interestingly, for nrev and qsort, RBMM and CTGC produce identical  
memory usage. For queens, however, CTGC was only able to reduce  
memory usage by about 4%, whereas RBMM even without region reuse  
produces nearly optimal memory usage, and with reuse it produces  
fully optimal usage (over 99% in both cases). While I don't have  
nearly enough data (or the theoretical background) to prove it  
conclusively, it is my suspicion that RBMM will always perform equal  
to or better than CTGC (assuming region reuse and the above  
considerations) in terms of memory usage.

On the other hand, while memory usage was identical for CTGC and RBMM  
for nrev and qsort, runtime performance was not. For nrev, CTGC had  
clearly better performance (79% vs 57%) whereas for qsort RBMM had  
clearly better performance (42% vs 26%).

Nancy mentioned in her thesis that the "graph" selection method could  
also be tuned with runtime rather than total memory reused as its  
weight criterion, as outlined by Debray. That will probably be the  
best direction for CTGC, using the overheads of RBMM for the cost  
measure. It may also (hopefully) simplify the version generation  
problem, for which filesize is also an important consideration when  
combining with RBMM. It's not clear that things will be so  
straightforward, but as complicated as the interaction between the  
two is, a performance study would probably be more revealing than  
trying to resolve it theoretically.

Related questions:

What has/hasn't been implemented in RBMM currently?

-Marc
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions:          mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
Previous message: [m-dev.] Deep profiling assertion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the developers mailing list