[m-dev.] Mercury and GCC split stacks

Zoltan Somogyi zoltan.somogyi at runbox.com
Mon Sep 8 23:13:40 AEST 2014



On Mon, 8 Sep 2014 16:33:12 +1000, Paul Bone <paul at bone.id.au> wrote:
> > You could do this, but only until you ran out of the small number
> > of superpages you are allowed. This number is in the single, maybe double
> > digits on current and foreseeable machines; the 10,000 threads
> > you were talking about in an earlier email are right out.
> > 
> 
> Oh, Is that a hardware limitation?

Yes, it is.

> If superpages are rarely applicable because of a limitation like this, then
> they're going to be rarely used to decrease TLB usage.

They are rarely used, but I am pretty sure that the main reason
is simply that most programmers not only don't know about them,
they also don't know about the problem they are solving.
Nobody looks for solutions to problems they don't realize they
are having. Most profiling tools don't say anything about things like
TLB misses.

At the end of 2012, I helped some of Andrew Turpin's colleagues
solve a problem with the performance of their code. They counted
cycles and know EXACTLY how long their code SHOULD have taken,
but were baffled because it took significantly longer. When I told
them that their problem was TLB thrashing and that using superpages
would solve it, they started using them, and got the performance
they originally expected.

If even researchers at universities are baffled by this issue,
what hope do ordinary programmers have? I know that most
universities do NOT do a good job of teaching the relevant
aspects of computer architecture. The University of Melbourne
did teach those aspects, but doesn't teach all of them anymore.
We used to have a 24 lecture subject on operating systems
and another 24 lecture subject on computer architecture.
Thanks to the Melbourne Model, we (or rather, they) now have
a single 24 lecture subject teaching both.

> Which means using a
> protected page at the end of each stack segment, especially when that
> protected page is in a different 4MB area, may not decrease the amount
> that superpages are actually used.

Decrease compared to what? You cannot reduce a natural number
when you start at zero.

> I agree that using protected memory to check for stack overflow is not
> universally a good idea.  I'm happy to agree that we'd need to test and
> compare this with the current approach, and that the program and environment
> will strongly contribute to which approach is best.

Yes, that was my point.

Stack-end tests using mmap:
positive: do not need explicit tests at the starts of procedure bodies
negative: cannot use superpages

Stack-end tests using explicit code in procedure prologues:
negative: the test code must be executed
positive: can use superpages

We need measurements to see which has the greater net positive effect.
However, the tradeoffs have changed a LOT since we originally decided
to use mmap in 1993/1994. Then, very few computers were superscalar,
and effectively none were out-of-order superscalar. They also had
small memories compared to today, which made their TLBs relatively
more effective (they covered a larger fraction of their main memories
than today's TLBs do).

On current computers, the cost of executing the explicit tests is actually
reasonably likely to be zero, since (a) the conditional branch to the
code that handles running out of stack is virtually never taken, so the
hardware branch prediction mechanism can predict it with near-perfect
accuracy, and (b) since this code does not compute anything that
the procedure body itself needs, the procedure body code will never
suspend waiting for it. Therefore the test code can be executed
whenever an out-of-order superscalar CPU has some spare execution
slots. Modern CPUs can issue and execute more instructions in parallel
than virtually all programs can use, due to dependencies, so these
spare slots are very, very likely to be available.

The main impact of test code on the execution of the rest of the
program is likely to be via adding more conditional branches to
the program, making the job of the branch predictor harder
(by polluting its cache), thus causing it to be somewhat less effective,
and increasing the workload of the circuitry keeping track of
in-flight instructions, sometimes causing it to bump into limits
where (without those tests) it wouldn't have bumped into them.

Zoltan




More information about the developers mailing list