[m-dev.] I want to move and rename the dependency_graph module

Zoltan Somogyi zoltan.somogyi at runbox.com
Wed Feb 15 14:15:31 AEDT 2017



On Wed, 15 Feb 2017 13:08:47 +1100, Paul Bone <paul at bone.id.au> wrote:
> > Lets say an SCC contains n procedures, P1 through Pn.
> > If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
> > Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
> > calls the first, then inlining is clearly what we want to do. For each
> > Pi that is called from above the SCC, we would inline the callee at every
> > tail recursive call site except the one that calls Pi itself. This will give
> > both the Mercury compiler and later the target language compiler the
> > best chance to optimize the code. (Any recursive calls that are not TAIL
> > recursive would be left alone.)
> > 
> > If the number of entry points to the SCC is Ne, this will yield Ne copies
> > of the code of every procedure in the SCC. However, I think we can handle that.
> > 
> 
> We have to include any recursive calls that are left alone in the number Ne
> (for this purpose).  So it could be higher.

Yes, you are right. If any member of the SCC is called from the SCC itself
in a non-tail position, that member would need an entry point, even if it is
not called from anywhere in any higher SCC.

> I wanted to think about in what cases inlining non-tail calls would also
> help.

When the callee is a recursive predicate, inlining it is rarely a good idea,
outside of special circumstances such as the discussed above.

> Inlining b into a will create more code.  This is also something we'd need
> to factor into inlining decisions.

The decisions you want inlining to make is not "do I inline B into A?", but
rather "do I inline B into *this call site* in A?". It is perfectly possible
to inline B into some call site(s) in A but not into others.

> I would propose a cost model, except that we cannot hope to (analytically)
> know the cost of code duplication vs the cost of non-tail calls.  However I
> think it would be useful to create and agree upon some kind of model.  Even
> if we can't compare code duplication and non-tail calls, we can calculate
> code duplication.

Actually, that is effectively impossible to do accurately in inlining.m by itself.
The reason is what you *want* to measure is NOT the immediate increase
in code size generated by inlining, but the *final*, overall increase in code size,
and *that* is affected by every optimization that runs after inlining.
That measure is affected by how the particular details of a given call site
affect the optimizations that can be applied to inlined copy of the callee.
For example, if the callee has a switch on X, and a call site sets the variable
that becomes the input arg X to a given function symbol, later optimizations
can replace the whole switch with just the selected arm. This means that
in some cases, even inlining a procedure with a huge body can lead to an
overall *reduction* in code size, if later optimizations can delete everything
in the callee's body except e.g. a switch arm containing a single unification,
whose code is smaller than the code required to make the call.

Therefore *accurate* computation of the code duplication you end up with
(as opposed to what you *start* with) is possible only if inlining runs
all the later passes of the compiler on the output of each inlining trial.

> I was thinking of something that is the super-set each PiI1..In set. 

The names we generate for MLDS variables include the variable number
of the HLDS variable they represent. Since these will coincide between
procedures only by accident, this is essentially equivalent. However,
just because there is an MLDS variable named ABC_42 in both p and q
does NOT mean that they are bound to even have the same type,
much less any kind of data flow relationship.

> One thought about inlining the TSCC members, rather than calling them in
> each switch arm.  Is that if they are outlined we can let the C compiler
> decide if inlining is worth while.  It may have more information than we do
> to make such decisions.

It *may* have more information than we do about whether inlining is
worthwhile, though I doubt it. I am certain, though, that *we* have
more information about whether inlining is *safe*. Since the target language
compilers will always need to be conservative about safety, I don't think
leaving the inlining up to them is a good idea.

> In either case I like this idea, after inlining, the C compiler has a lot of
> freedom in how it places and accesses the variables that were the input
> arguments between the calls.

Yes.
 
> > To achieve this, the MLDS backend would have to do code generation SCC by SCC.
> > For each SCC, it would need to know what TSCCs, if any, exist inside it.
> > It would do this by using the same algorithm to find SCCs as we already use,
> > but this time, using as edges only the calls in the SCC that are both
> > RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
> > into one or more TSCCs. (While all the procedures in the SCC are by definition
> > reachable from all other procedures in the SCC, they need not be reachable
> > via TAIL calls from all other procedures in the SCC.)
> > 
> > For TSCCs that contain only one procedure, we would translate that procedure
> > the usual way. For TSCCs that contain two or more procedures, we would want
> > to generate code that follows the scheme above.
> 
> Would an MLDS->MLDS transformation be easier?

No.

If you have a choice between generating the wrong thing and then fixing it up
or generating the right thing directly, always choose the latter. The code that
does the fixups in the first approach would always be vulnerable to changes
in the code generation itself screwing up the patterns that it looks for;
this could happen even if those changes were optimizations for unrelated
purposes.

I learned this lesson the hard way.

> > > This may share code with the above idea, simply
> > > disabling the use of a trampoline when we know the C compiler matches one
> > > that we know will do the mutual recursion.
> > 
> > That test would be a bitch to keep up to date.
> 
> I don't think it would be that hard.  However I prefer the above solution
> anyway, I think it will (usually) lead to better code/performance.

I wasn't thinking about it being hard; I was thinking that it would be *annoying*,
given that it would have to be repeated after every new version of gcc/clang/etc.
 
> > Another issue is that mutually recursive procedures often have one or more
> > arguments that are passed to them from above the SCC and which are passed down
> > the chain of recursive calls. In some cases, the argument is always passed
> > unchanged; in some other cases, it is sometimes passed unchanged, and
> > sometimes passed updated. In both cases, we should avoid the need for
> > every tail recursive call to set the value of these parameters, because
> > that is wasted work in the common case that the value is passed along
> > unchanged.
> > 
> > With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
> > P3I2. The target language compiler may find out that it can store all these
> > variables in the same stack slot, making assignments such as P3I2 = P2I1
> > just before a tail call from P2 to P3 a no-operation. If some target language
> > compilers don't, then the Mercury compiler could itself compute which
> > the sets of input arguments, one argument from each TSCC, form a single
> > parameters in this sense. We could then get ml_gen_proc to refer to that
> > input argument by the shared name, and ml_call_gen to optimize its passing.
> 
> Implementing this in the MLDS as a seperate excess assignment elimination
> wouldn't be too difficult.

The existing excess assignment elimination pass works only on conjunctions;
if it sees that one conjunct is X1 := X0, then it can replace X0 with X1,
or vice versa, in all the later conjuncts. This is relatively simple to do.
In the case we are talking about, the assignment Xq := Xp will appear just before
the tailcall from p to q, implemented as a branch back to the top of the driver
loop, so this won't work. You will have to extend the analysis to q's branch in
the switch, at which point the analysis needed is so different that you may as well
call it by some other name.

When a procedure body updates STATE_VARIABLE_X_0 to become
STATE_VARIABLE_X_1 and then never refers to STATE_VARIABLE_X_0 again,
I trust the target language compilers to reuse the same stack slot for both,
because keeping track of such things is easy for them in code without backward
branches. Tail calls introduce backward branches, making the job much harder,
not just for target language compilers, but for any MLDS analyses as well.

By contrast, things are much simpler at the HLDS level: you just need to match
the input args in the procedure head with those in the input arg list of the
recursive call. If e.g. input arg 2 in the head and input arg 3 in the tail call
have the same type and the same name modulo a numeric suffix, then we
can use the same MLDS var for input arg 2 of this proc and input arg 3 of the
tail call's callee proc.

Obviously, this test can be made more sophisticated, and independent of
variable names, but this simple test will get up a large fraction, probably
well over half, of the improvement that is potentially available from
this kind of optimization.

> I have a version of ml_dependency_graph.m in my workspace.  It can be
> modified to add support for TSCCs.  My big question at this point is whether
> we'd prefer to do this as an MLDS->MLDS transformation.

I don't think so. As far as I can see, implementing tail calls in the MLDS
is best done at the HLDS to MLDS stage, not as an MLDS to MLDS transformation.
 
> They have to be linear, there can only be one tail call on any branch.
> Other SCC calls make it non-linear, but they're not TSCC calls.  Maybe I
> misunderstood what you mean by linear.

I meant that each procedure in the SCC contains just one tail call,
to some member of the SCC other than itself. For example,
if an SCC contains p, q, r and s, then the only tail calls are p->q,
q->r, r->s and s->p. If you add a tail call p->r, then it is not linear.
This kind of linearity test is for the inlining we discussed at the top.

Zoltan.




More information about the developers mailing list