[m-dev.] Mercury and GCC split stacks

Zoltan Somogyi zoltan.somogyi at runbox.com
Tue Sep 9 23:36:20 AEST 2014



On Tue, 9 Sep 2014 15:26:00 +1000, Paul Bone <paul at bone.id.au> wrote:
> That was one of my questions, is the small number of superpages a hardware
> restriction?  If it was software (the OS) I wondered why it was there at
> all.

It is a hardware restriction on current machines. I don't know of
any machines that have lifted it, but I haven't checked in a while.

> I knew that there was a shadow stack to make subroutine return faster, but I
> didn't realize it used the branch prediction stuff.  I've always been
> curious whether Mercury is slower because it doesn't use "return" and
> therefore this shadow stack is unused.

I have been curious about that too. Unfortunately, I don't see any way
of answering the question that would not require a huge amount of work.

> > the hardware will detect this and recover, losing performance,
> > but not correctness.
> 
> Yep, it has to cancel the in-flight work and try again.  I forget the word
> for this, pipeline flush or pipeline stall.

That is a pipeline flush. A stall is simply waiting for something you need
(a piece of data, a hardware resource) to become available. A flush
is MUCH more expensive than a stall. This is a major constraint
on the design of CPU microarchitectures. Almost all aspects of CPU
performance would improve if designers made pipelines deeper,
but this would worsen the branch misprediction penalty (the cost
of this pipeline flush) so much that in many cases, it would take
away all the benefit of the deeper pipeline. The pipeline depths of
current machines are usually carefully designed to balance these
two effects. (The rest of the time, they use the pipeline structure
of the previous machine in the family, because they don't have
the time or the money to redesign it.)

> x86 supports a prefix to conditional jump instructions that advises the CPU
> if the jump is likely to be taken.  Do you know if the CPU will ignore it's
> branch prediction in this case, in particular does it not update the
> branch table.

No, I don't know. I try to avoid looking at x86 instruction manuals,
since I would like to keep my sanity :-(

> I remember you once telling me of someone trying this kind of optimisation
> (not on x86) though, and getting their conditions backwards and it didn't
> _slow_ the program down.  This called into question how valuable it would
> be.

That story was about FREQUENCY statements, which guided the optimization
of three-way if-statements in old dialects in FORTRAN. Due to a bug, they
implemented it backwards (the generated code was therefore pessimized,
not optimized), but since nobody bothered to check whether the
"optimization" actually lead to a speedup, nobody noticed for a while.
That story was from the 1960s.

Zoltan.



More information about the developers mailing list