[m-dev.] Mercury and GCC split stacks

Zoltan Somogyi zoltan.somogyi at runbox.com
Tue Sep 9 13:10:59 AEST 2014

On Tue, 9 Sep 2014 11:36:10 +1000, Paul Bone <paul at bone.id.au> wrote:
> I disagree that using mmap or mprotect outright prevents the use of
> superpages.  AIUI it's still be possible provided that we lay out the
> memory carefully.

How? The context of our discussion is 10,000 stacks, so with redzones,
you need 10,000 changeovers from "access permitted" to "access denied".
The maximum number of superpages supported is typically something like 4,
and on x86s and most other ISAs there can be no permission changes
inside a page, so this would get you only 4 such changeovers, however
carefully you laid out the stacks.

You could use superpages for 4 stacks out of 10,000, I guess,
but I wouldn't consider such a system "elegant". How would you
decide which stacks get the superpages?

(CAVEAT: there ARE some designs for virtual memory systems
that allow different parts of a page to have different permissions.
However, most working programmers have never seen one in real life,
and the whole point of superpages is to give the TLB information
in a format that is a MORE compact description of the address space,
not LESS.)

> I believe that currently our stseg grades work
> via an explicit check.

They do.

> I wonder if rather than compiling this check into the start of each
> procedure if we move it to a static code location, then for the whole
> program there is only one entry in the branch prediction tables.

But that would add other overheads, including the jump to this code,
and jump back. The former would be an absolute jump (at least without
-fpic) and not need a branch prediction, but the latter would need prediction;
not about whether the branch is taken (it will be), but about where
to branch back to. Such returns can be handled by the CPU branch
prediction unit, which typically has its own fixed-size miniature stack of
subroutine return addresses. Your proposal would effectively take away
one entry in this circuit from user level code.

See, after taking away low-hanging fruit, overhead is often like
toothpaste; you can squeeze it out here, but that only moves to there.

Since the asm_fast grades do not use call/return instructions
(they should all get compiled into plain branches, the kind intended
for use WITHIN subroutines), this may not be a big deal for us.

> This
> reduces pressure on these tables and means that for procedures that haven't
> yet been called the probable branch is already cached.

Actually, except at program startup, ALL branches are effectively cached,
even if you haven't executed them yet.

Confused? This happens because branch prediction caches do not
actually IDENTIFY the jump whose outcome they are predicting.
All they say is something like "a jump whose address has the last N bits
1001..1100 behaved this way the last time it was executed, so if you
see a  jump with the same last N bits, predict it to do the same".
(CAVEAT: I am STILL simplifying some things.) Since the table typically
contains something like 2k to 16k entries (last I checked), it gets filled
REALLY quickly. The reason why it works is twofold: most of the time,
the instruction the prediction is used for actually IS the instruction
for which the prediction was recorded, and even if the prediction is
wrong, the hardware will detect this and recover, losing performance,
but not correctness.

But you are right: reducing the number of conditional branches
does reduce the probability that a stack-extension tests's jump
will collide with some other jump in the table, screwing up
its prediction.


More information about the developers mailing list