[m-dev.] Mercury and GCC split stacks

Paul Bone paul at bone.id.au
Tue Sep 9 15:26:00 AEST 2014


On Tue, Sep 09, 2014 at 05:10:59AM +0200, Zoltan Somogyi wrote:
> 
> 
> On Tue, 9 Sep 2014 11:36:10 +1000, Paul Bone <paul at bone.id.au> wrote:
> > I disagree that using mmap or mprotect outright prevents the use of
> > superpages.  AIUI it's still be possible provided that we lay out the
> > memory carefully.
> 
> How? The context of our discussion is 10,000 stacks, so with redzones,
> you need 10,000 changeovers from "access permitted" to "access denied".
> The maximum number of superpages supported is typically something like 4,
> and on x86s and most other ISAs there can be no permission changes
> inside a page, so this would get you only 4 such changeovers, however
> carefully you laid out the stacks.

That was one of my questions, is the small number of superpages a hardware
restriction?  If it was software (the OS) I wondered why it was there at
all.

> 
> You could use superpages for 4 stacks out of 10,000, I guess,
> but I wouldn't consider such a system "elegant". How would you
> decide which stacks get the superpages?
> 
> (CAVEAT: there ARE some designs for virtual memory systems
> that allow different parts of a page to have different permissions.
> However, most working programmers have never seen one in real life,
> and the whole point of superpages is to give the TLB information
> in a format that is a MORE compact description of the address space,
> not LESS.)

Yep.

> > I believe that currently our stseg grades work
> > via an explicit check.
> 
> They do.
> 
> > I wonder if rather than compiling this check into the start of each
> > procedure if we move it to a static code location, then for the whole
> > program there is only one entry in the branch prediction tables.
> 
> But that would add other overheads, including the jump to this code,
> and jump back. The former would be an absolute jump (at least without
> -fpic) and not need a branch prediction, but the latter would need prediction;
> not about whether the branch is taken (it will be), but about where
> to branch back to. Such returns can be handled by the CPU branch
> prediction unit, which typically has its own fixed-size miniature stack of
> subroutine return addresses. Your proposal would effectively take away
> one entry in this circuit from user level code.

Ah, I thought that that would just be handled by the pipelining stuff.  Once
the destination address was available the fetch/decode stuff would continue
reading the instruction stream.

> See, after taking away low-hanging fruit, overhead is often like
> toothpaste; you can squeeze it out here, but that only moves to there.
> 
> Since the asm_fast grades do not use call/return instructions
> (they should all get compiled into plain branches, the kind intended
> for use WITHIN subroutines), this may not be a big deal for us.

I knew that there was a shadow stack to make subroutine return faster, but I
didn't realize it used the branch prediction stuff.  I've always been
curious whether Mercury is slower because it doesn't use "return" and
therefore this shadow stack is unused.

> > This
> > reduces pressure on these tables and means that for procedures that haven't
> > yet been called the probable branch is already cached.
> 
> Actually, except at program startup, ALL branches are effectively cached,
> even if you haven't executed them yet.
> 
> Confused? This happens because branch prediction caches do not
> actually IDENTIFY the jump whose outcome they are predicting.
> All they say is something like "a jump whose address has the last N bits
> 1001..1100 behaved this way the last time it was executed, so if you
> see a  jump with the same last N bits, predict it to do the same".
> (CAVEAT: I am STILL simplifying some things.) Since the table typically
> contains something like 2k to 16k entries (last I checked), it gets filled
> REALLY quickly. The reason why it works is twofold: most of the time,
> the instruction the prediction is used for actually IS the instruction
> for which the prediction was recorded, and even if the prediction is
> wrong, the hardware will detect this and recover, losing performance,
> but not correctness.

Yep, it has to cancel the in-flight work and try again.  I forget the word
for this, pipeline flush or pipeline stall.

> But you are right: reducing the number of conditional branches
> does reduce the probability that a stack-extension tests's jump
> will collide with some other jump in the table, screwing up
> its prediction.

x86 supports a prefix to conditional jump instructions that advises the CPU
if the jump is likely to be taken.  Do you know if the CPU will ignore it's
branch prediction in this case, in particular does it not update the
branch table.

I remember you once telling me of someone trying this kind of optimisation
(not on x86) though, and getting their conditions backwards and it didn't
_slow_ the program down.  This called into question how valuable it would
be.

Thanks.


-- 
Paul Bone



More information about the developers mailing list