[m-dev.] Mercury and GCC split stacks

Paul Bone paul at bone.id.au
Tue Sep 9 11:36:10 AEST 2014

On Mon, Sep 08, 2014 at 03:13:40PM +0200, Zoltan Somogyi wrote:
> On Mon, 8 Sep 2014 16:33:12 +1000, Paul Bone <paul at bone.id.au> wrote:
> > > You could do this, but only until you ran out of the small number
> > > of superpages you are allowed. This number is in the single, maybe double
> > > digits on current and foreseeable machines; the 10,000 threads
> > > you were talking about in an earlier email are right out.
> > > 
> > 
> > Oh, Is that a hardware limitation?
> Yes, it is.
> > If superpages are rarely applicable because of a limitation like this, then
> > they're going to be rarely used to decrease TLB usage.
> They are rarely used, but I am pretty sure that the main reason
> is simply that most programmers not only don't know about them,
> they also don't know about the problem they are solving.
> Nobody looks for solutions to problems they don't realize they
> are having. Most profiling tools don't say anything about things like
> TLB misses.
> At the end of 2012, I helped some of Andrew Turpin's colleagues
> solve a problem with the performance of their code. They counted
> cycles and know EXACTLY how long their code SHOULD have taken,
> but were baffled because it took significantly longer. When I told
> them that their problem was TLB thrashing and that using superpages
> would solve it, they started using them, and got the performance
> they originally expected.
> If even researchers at universities are baffled by this issue,
> what hope do ordinary programmers have? I know that most
> universities do NOT do a good job of teaching the relevant
> aspects of computer architecture. The University of Melbourne
> did teach those aspects, but doesn't teach all of them anymore.
> We used to have a 24 lecture subject on operating systems
> and another 24 lecture subject on computer architecture.
> Thanks to the Melbourne Model, we (or rather, they) now have
> a single 24 lecture subject teaching both.

I agree.  Even if a student attends and passes both 24-lecture subjects.
they may not retain all the information because it's not often put into
practice.  You and I have talked about TLB caches before, and their
problems.  I'd also like to think that I'm good at optimisation and these
kinds of low level details, however I often forget that TLBs exist and don't
often realize when they have an effect.  For example, this e-mail thread.

> > Which means using a
> > protected page at the end of each stack segment, especially when that
> > protected page is in a different 4MB area, may not decrease the amount
> > that superpages are actually used.
> Decrease compared to what? You cannot reduce a natural number
> when you start at zero.

Assuming the other conditions for using superpages are met and the feature
is enabled.

> > I agree that using protected memory to check for stack overflow is not
> > universally a good idea.  I'm happy to agree that we'd need to test and
> > compare this with the current approach, and that the program and environment
> > will strongly contribute to which approach is best.
> Yes, that was my point.
> Stack-end tests using mmap:
> positive: do not need explicit tests at the starts of procedure bodies
> negative: cannot use superpages
> Stack-end tests using explicit code in procedure prologues:
> negative: the test code must be executed
> positive: can use superpages

I disagree that using mmap or mprotect outright prevents the use of
superpages.  AIUI it's still be possible provided that we lay out the
memory carefully.

> We need measurements to see which has the greater net positive effect.
> However, the tradeoffs have changed a LOT since we originally decided
> to use mmap in 1993/1994. Then, very few computers were superscalar,
> and effectively none were out-of-order superscalar. They also had
> small memories compared to today, which made their TLBs relatively
> more effective (they covered a larger fraction of their main memories
> than today's TLBs do).
> On current computers, the cost of executing the explicit tests is actually
> reasonably likely to be zero, since (a) the conditional branch to the
> code that handles running out of stack is virtually never taken, so the
> hardware branch prediction mechanism can predict it with near-perfect
> accuracy, and (b) since this code does not compute anything that
> the procedure body itself needs, the procedure body code will never
> suspend waiting for it. Therefore the test code can be executed
> whenever an out-of-order superscalar CPU has some spare execution
> slots. Modern CPUs can issue and execute more instructions in parallel
> than virtually all programs can use, due to dependencies, so these
> spare slots are very, very likely to be available.

Yes.  That's a good point.  I believe that currently our stseg grades work
via an explicit check.  There is some support in mercury_zones.[ch] for the
mmap/mprotect method though, but I haven't tested it.

> The main impact of test code on the execution of the rest of the
> program is likely to be via adding more conditional branches to
> the program, making the job of the branch predictor harder
> (by polluting its cache), thus causing it to be somewhat less effective,
> and increasing the workload of the circuitry keeping track of
> in-flight instructions, sometimes causing it to bump into limits
> where (without those tests) it wouldn't have bumped into them.

I wonder if rather than compiling this check into the start of each
procedure if we move it to a static code location, then for the whole
program there is only one entry in the branch prediction tables.  This
reduces pressure on these tables and means that for procedures that haven't
yet been called the probable branch is already cached.

Paul Bone

More information about the developers mailing list