[m-dev.] Mercury and GCC split stacks

Paul Bone paul at bone.id.au
Tue Sep 9 23:55:00 AEST 2014

On Tue, Sep 09, 2014 at 03:36:20PM +0200, Zoltan Somogyi wrote:
> On Tue, 9 Sep 2014 15:26:00 +1000, Paul Bone <paul at bone.id.au> wrote:
> > That was one of my questions, is the small number of superpages a hardware
> > restriction?  If it was software (the OS) I wondered why it was there at
> > all.
> It is a hardware restriction on current machines. I don't know of
> any machines that have lifted it, but I haven't checked in a while.
> > I knew that there was a shadow stack to make subroutine return faster, but I
> > didn't realize it used the branch prediction stuff.  I've always been
> > curious whether Mercury is slower because it doesn't use "return" and
> > therefore this shadow stack is unused.
> I have been curious about that too. Unfortunately, I don't see any way
> of answering the question that would not require a huge amount of work.

And that wouldn't involve some other trade-offs that might mask the effect
of the change.  Even microbenchmarks would be difficult to use because the
machine's caches and memory system won't be under the same stresses.

> > > the hardware will detect this and recover, losing performance,
> > > but not correctness.
> > 
> > Yep, it has to cancel the in-flight work and try again.  I forget the word
> > for this, pipeline flush or pipeline stall.
> That is a pipeline flush. A stall is simply waiting for something you need
> (a piece of data, a hardware resource) to become available. A flush
> is MUCH more expensive than a stall. This is a major constraint
> on the design of CPU microarchitectures. Almost all aspects of CPU
> performance would improve if designers made pipelines deeper,
> but this would worsen the branch misprediction penalty (the cost
> of this pipeline flush) so much that in many cases, it would take
> away all the benefit of the deeper pipeline. The pipeline depths of
> current machines are usually carefully designed to balance these
> two effects. (The rest of the time, they use the pipeline structure
> of the previous machine in the family, because they don't have
> the time or the money to redesign it.)

Yep, I remember the first Pentium 4s had a much deeper pipeline that made
the flushes more expensive.

> > x86 supports a prefix to conditional jump instructions that advises the CPU
> > if the jump is likely to be taken.  Do you know if the CPU will ignore it's
> > branch prediction in this case, in particular does it not update the
> > branch table.
> No, I don't know. I try to avoid looking at x86 instruction manuals,
> since I would like to keep my sanity :-(

Ah, time permitting I might try this.

> > I remember you once telling me of someone trying this kind of optimisation
> > (not on x86) though, and getting their conditions backwards and it didn't
> > _slow_ the program down.  This called into question how valuable it would
> > be.
> That story was about FREQUENCY statements, which guided the optimization
> of three-way if-statements in old dialects in FORTRAN. Due to a bug, they
> implemented it backwards (the generated code was therefore pessimized,
> not optimized), but since nobody bothered to check whether the
> "optimization" actually lead to a speedup, nobody noticed for a while.
> That story was from the 1960s.

Ah so it did have an affect but no-one measured it or otherwise noticed.

Paul Bone

More information about the developers mailing list