[m-dev.] RBMM & CTGC
Marcus Troyka
aeonioshaplo at gmail.com
Sat Apr 21 09:15:00 AEST 2012
I've been reading over the theses related to memory management
recently. I've read them all before, although generally I only ended
up with a headache.
This time, however, I was able to extract some interesting details
that it seems the original developers missed, and thought I'd share
here for anyone interested.
RBMM:
1. It seems to me all the fancy liveness/structure sharing analysis
work done for CTGC could rather easily be carried over and reused for
RBMM. Considering that it already manages multiple modules and
backwards liveness, I'd say that's a lot of work saved.
2. I found a few fairly significant errors in Quan Phan's thesis, in
relation to the benchmarks section. The first error involves the
following entries (in general):
>>>>>> Heap ; RB-macro ; RB-func ; RB-func-Rtracking
boyer: 14.86 ; 14.57 ; 19.83 ; 19.85
crypt: 16.70 ; 14.66 ; 15.97 ; 16.11
healthy: 7.97 ; 10.73 ; 16.30 ; 6.46
life: 11.58 ; 12.13 ; 15.21 ; 10.54
queens: 19.47 ; 15.94 ; 19.70 ; 19.92
(runtime in seconds)
In his thesis, Quan related the poor performance of the boyer
benchmark (and cases like it) to the overhead of saving region-
tracking information. However, in the run-time table, you can clearly
see that the difference in runtime for boyer between the plain-
function and region-tracking-function versions was only 0.02 seconds.
However, the difference between the versions implemented in macros
and the versions implemented in functions is ~5 seconds, 2 orders of
magnitude greater. This same pattern carries over for all of the
examples; wherever region-tracking or macros are slower in run-time,
the difference vs plain functions is negligible. Sometimes region-
tracking or macros are faster, and sometimes both are faster, but
neither is ever significantly slower, and each is dramatically faster
in some cases. In the examples of life and healthy, both macros and
region-tracking are dramatically faster.
Of course, the main argument against using macros is that they can as
much as double filesize. However, the filesize savings gained by
region-tracking should apply more significantly to macros since
they're copied extensively, so the spread between macros and
functions should become much more reasonable with region-tracking.
My guess as to the reason for this performance result is that,
compared to the Boehm collector which is called relatively
infrequently and is very expensive, the RM functions are called very
often and are simple/cheap with respect to the cost of a function
call. I consider it a promising result, however.
3. The other error is involves the memory usage statistics presented.
While in general they are accurate, there are several cases where the
reported statistics are misleading.
rqueens- max regions: 2 ; max size: 48 (theoretically) ; SLR: 24
rlife- max regions: 102 ; max size: 1856 ; SLR: 1722
If each region page is 2048 words in size, then with two regions
queens cannot be running with less than 4096 words heap size, even if
it only uses 48 of those. Life is even worse, at 102 regions it has a
minimum heap size of 208,896 words, using only 1856 of those. While
these figures are still very favorable compared to the Boehm
collector, they demonstrate the potential space leak caused by "slack
space" in small regions. Requesting 100 of these pages at a time also
limits the minimum size of any program making at least one expansion
to 204,800 words, which is still far sub-optimal.
MLKit (as reported by Makholm, 2000) deals with the former problem by
determining which regions are known to be smaller than the minimum
page size at compile time, and having those regions use "mini-pages"
of perhaps 128-256 words. Makholm used 16 word pages by default,
however this resulted in significant fragmentation (and overhead),
which larger page sizes avoid. I would suggest using 1024 word
default page sizes since that's what virtually all modern MMUs will
return.
The latter problem might be solved by using exponential memory
requests. Based on collected data, I would say (given 1024 word
pages) starting with 4 pages, then 32, 128, 256, and 512 for each
memory request would probably keep requests to a minimum without
overbloating the heap unnecessarily. If any region is known to be
larger than a certain size at compile time, the first request could
be bumped up to the next largest size. Also, 4096 word pages could be
used for larger regions known at compile time.
4. Stack allocation of regions is probably not a good idea. There are
too many ways for it to interfere with other things allocated to the
stack, and the stack behaves differently in parallel and non-parallel
environments and between architectures. Region merging and static
size analysis should provide the same performance improvement anyway.
CTGC vs RBMM:
After comparing the papers on RBMM and CTGC numerous times, I've
finally come to some reliable conclusions about the interactions and
relationship between CTGC and RBMM. Contrary to Quan, I don't see
them as differing on the level of granularity, although at first
glance that does seem to be the case. Rather, the level of
granularity between the two is identical; both handle memory
assignment on the level of individual cells, both use the exact same
(basic) analysis information, and both are trying to reuse the same
memory. The difference between them is the definition of locality
implied: CTGC tries to conserve spatial locality whereas RBMM tries
to conserve temporal locality.
What this means is that, if you were to hypothetically rewrite CTGC
to only reuse memory cells that are allocated into the same region,
then CTGC would produce no local reuse because of the opposite
behavior of the two.
In my paper-digging efforts, I found a few benchmarks for which data
is available for both CTGC and RBMM, and what I found (although
tentative) was interesting. The only three benchmarks for which I
could find data for both RBMM and CTGC were nrev, qsort, and queens.
Interestingly, for nrev and qsort, RBMM and CTGC produce identical
memory usage. For queens, however, CTGC was only able to reduce
memory usage by about 4%, whereas RBMM even without region reuse
produces nearly optimal memory usage, and with reuse it produces
fully optimal usage (over 99% in both cases). While I don't have
nearly enough data (or the theoretical background) to prove it
conclusively, it is my suspicion that RBMM will always perform equal
to or better than CTGC (assuming region reuse and the above
considerations) in terms of memory usage.
On the other hand, while memory usage was identical for CTGC and RBMM
for nrev and qsort, runtime performance was not. For nrev, CTGC had
clearly better performance (79% vs 57%) whereas for qsort RBMM had
clearly better performance (42% vs 26%).
Nancy mentioned in her thesis that the "graph" selection method could
also be tuned with runtime rather than total memory reused as its
weight criterion, as outlined by Debray. That will probably be the
best direction for CTGC, using the overheads of RBMM for the cost
measure. It may also (hopefully) simplify the version generation
problem, for which filesize is also an important consideration when
combining with RBMM. It's not clear that things will be so
straightforward, but as complicated as the interaction between the
two is, a performance study would probably be more revealing than
trying to resolve it theoretically.
Related questions:
What has/hasn't been implemented in RBMM currently?
-Marc
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions: mercury-developers-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
More information about the developers
mailing list