[m-dev.] I want to move and rename the dependency_graph module

Zoltan Somogyi zoltan.somogyi at runbox.com
Tue Feb 14 16:58:05 AEDT 2017


On Mon, 13 Feb 2017 17:10:20 +1100, Paul Bone <paul at bone.id.au> wrote:
> I hadn't yet planned how to do it.  It's fairer to say that I was planning
> how to plan to do it ;-)  There are a number of things that we (YesLogic has
> had some discussions) think are worth trying.

My original question was about what designs for ParamStruct you have explored,
but it seems I have gone further on this than you have. A brain dump follows,
interleaved with relevant parts of your message.

All the following assumes model_det code; for model_semi, what I present below
would need tweaks, while for model_non, the whole discussion is moot.

> There are three approaches worth trying and their combinations.
> 
>  + Inlining
>  + We'll do it
>  + Let the C compiler do it.

In terms of implementation, the "inlining" approach is completely
independent of the others.

Lets say an SCC contains n procedures, P1 through Pn.
If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
calls the first, then inlining is clearly what we want to do. For each
Pi that is called from above the SCC, we would inline the callee at every
tail recursive call site except the one that calls Pi itself. This will give
both the Mercury compiler and later the target language compiler the
best chance to optimize the code. (Any recursive calls that are not TAIL
recursive would be left alone.)

If the number of entry points to the SCC is Ne, this will yield Ne copies
of the code of every procedure in the SCC. However, I think we can handle that.

First, if Ne is 2 or 3 or even 4, the code size increase is probably a price
most people would be willing to pay for reducing stack usage from linear
in the depth of the tail recursion to constant. Second, if Ne is so high
that the user would not want to pay the code size cost, it is trivial to
add a limit: don't do this if Ne exceeds a configurable threshold.
Third, in a nontrivial number cases Ne will in fact be just one. This is
because programmers may split the code executed in each iteration of a loop
into more than one piece when its length, complexity and/or its indentation
level becomes too much.

> I also want to
> read the HLDS inclining code to get a sense for what can be done easily.

Inlining.m is missing other good inlining heuristics as well;
it has just the most basic ones. Don't take its existing structure
as sancrosanct.

> We'll do it
> -----------
>
> This is the optimisation that we discussed last year.
> ...
>  The struct is passed by reference between all the members of the
> SCC.

The trampoline handles only TAIL recursive calls, so that if e.g.
Pk is not called by a TAIL call in the SCC, then (a) ParamStruct will never
need to handle the parameters of Pk, and likewise (b) fp will never point
to Pk.

Lets call the set of Pi that have TAIL calls to them in the SCC the TSCC.
The rest of this email will restricts its attention to the TSCC.

By definition, all the Pi in the TSCC have the same vector of output arguments
in terms of type and meaning, although the names of the variables representing
the same argument may differ between procedures. However, this is not true
for their input arguments.

Suppose the input arguments of Pi are PiI1 ... PiImi. The simplest design
for ParamStruct is something that corresponds to this C type,
where e.g. P1I1 stands for the type and the name of the first input arg
of the first proc in the TSCC.

    struct {
        P1I1,
        P1I2,
        ...
        P1Im1,

        P2I1,
        P2I2,
        ...
        P2Im2,

        ...

        PnI1,
        PnI2,
        ...
        PnImn
    }

(I am ignoring the outputs, but they would just be additional fields here.)

This is what you proposed, and it should work in all of our current MLDS
target languages.

The next simplest is this type:

    union {
        struct {
            P1I1,
            P1I2,
            ...
            P1Im1
        },
        struct {
            P2I1,
            P2I2,
            ...
            P2Im2
        },
        ...
        struct {
            PnI1,
            PnI2,
            ...
            PnImn
        }
    }

This would have smaller stack usage when targeting C, but would not work
for Java; I don't about C#. Also, it would need extension to the MLDS itself;
it already has init_struct, but it doesn't have init_union. Also, I don't
believe the MLDS has have real any support for structs on the stack, though
it definitely does have support for structs on the heap and in read-only
memory. This is because it always treats the variables it stores in stack
frames individually, not as a group.

When I thought about that, I realized that the trampoline loop does NOT
have to look like this:

>     ParamStruct s;
>     ...
>     while (fp != NULL) {
>         fp = (Func*) (*fp)(&s);
>     }

It can look like this:

    // PiIj for all i and j, NOT in a struct
    while (p_to_call != 0) {
        switch (p_to_call) {
            1:
                // the code of P1, in which each tail call (to e.g. Pk)
                // is replaced with
                //
                //     assignments to PkI1, ... PkImk;
                //     an assignment of k to p_to_call, and
                //     a continue
                p_to_call = 0;
                continue;
            2:
                // Likewise for the other procedures in the TSCC
            ...
        }
    }

To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)

For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.

For this, we would want a generalized version of the existing ml_gen_proc
in the compiler, which can be told that the procedure is part of a TSCC.
For such procedures, ml_proc_gen would need to put something into the
ml_gen_info that causes ml_call_gen to follow the template above for
tail recursive calls. It would also need to generate target language
variable names that have distinguishing prefix or suffix, so that
they won't clash with the names of target variables generated for
possibly-identically named variables in other procs in the TSCC.

The code generated for a procedure this way would be the body of one of
the switch arms above. We would need to generate the switch itself,
and everything that goes with it, for each entry point in the TSCC.
(Note that a procedure can be an entry point of a TSCC *without* being
an entry point of the SCC that contains the TSCC.) If the TSCC has just
one entry point, that is good. If it has more than one entry point,
this would mean code duplication, and the only difference between the copies
would be the code before the switch that initializes both p_to_call and
the input arguments of the procedure to call. It should be possible
to avoid this by generating a procedure whose argument list is

- all of the PiIj, and
- p_to_call.

Then for each entry point, we could just call this TSCC procedure
specifying the id of the procedure as p_to_call, the actual values
of its input arguments, and dummy values for the input args of all
the other procs in the TSCC.

However, I am not sure about how easy it would be to generate dummy
arguments that would be type correct in each target language.
I also think that duplicating the switch once, and may be twice,
would be preferable performance-wise to having to pass the dummy arguments
in the first place.

The setting of p_to_call to a value, followed by a continue that
goes immediately to a switch on p_to_call, is effectively a goto
to the switch arm selected by the value assigned to p_to_call.
I hope that most target language compilers, including gcc and clang,
would recognize this fact, effectively yielding the code you proposed.
However, letting that target language compiler do this would let us avoid
including the concepts of labels and gotos in the MLDS. (In fact, I think
that the absence of those concepts from the MLDS is one of the important
things that differentiates the MLDS from the LLDS.)

> Outputs can also be retrieved from this struct.
> 
> I imagined using this type of trampoline and struct because it seems to be
> the most straight forward, therefore the most likely to succeed.  So far my
> testing was to establish that handling mutual recursion would provide some
> benefit, and not how to handle it best, so I didn't test any other methods. 
> 
> One thing that occurs to me now is to pass the arguments normally.  We would
> have to pass _all_ the arguments of the members to _each_ of the members.

As I mention above, I am not sure that would be easy.

> This would allow the C compiler to decide where to place them (such as in
> registers).  Return parameters would also have to be handled.  (Hrm, even in
> normal code we could optimize in/out pairs by letting them share a single
> C parameter.)

See below.

> Another possibility for the trampoline is to return a token (a member of an
> enum) and switch on it to decide which function to execute next.  We can
> pass a more precise parameter list in this case.  This gives the C compiler
> more control of parameter passing, at the cost of a little more indirection
> in the trampoline.
> 
> Finally all the members of the SCC could be placed in the same function
> body, and use goto statements to handle tail-calls.  This is likely to be
> very fast, but it only works for SCCs with a single entry point and no
> recursive non-tail calls that arn't also that entry point, at least without
> code duplication.

See above for how this should be doable at reasonable programming cost.

> Third, it seems that if things are "just right" compilers like GCC / Clang
> can do the tail recursion themselves.  In practice "just right" means that
> the parameter lists and return values are the same. 

I think that approach would pretty fragile. We would want to give guarantees
about what tail calls consume stack and what calls don't; I don't think we can
subcontract such guarantees to the target language compiler.

> We could either use the
> struct from above, or simply make the parameter lists match by adding unused
> arguments as necessary. 

The latter is unnecessarily slow.

> This may share code with the above idea, simply
> disabling the use of a trampoline when we know the C compiler matches one
> that we know will do the mutual recursion.

That test would be a bitch to keep up to date.

> If we're able to place all the procedures in a single C function with gotos
> for tailcalls then that may be better than letting the C compiler implement
> the tailcalls, we'd need to test.

If the best that you can hope for from having the C compiler optimize tail calls
is that the optimization transforms code that uses separate functions into
code that replaces the tail calls with gotos in a merged function body,
then generating separate functions cannot yield faster code; the only reason
for preferring that approach would be easier coding on our part. I think
the approach I proposed above should be easy enough to implement.

------------------------------

Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.

With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.

------------------------------

The above approach places completely different demands on the dependency
graph module than your approach. First, it means that we DON'T need
dependency graphs for the MLDS, so the splitting of the dependency_graph.m
module isn't needed either. (Though moving it from transform_hlds to hlds
is still a good idea.) Second, we would need *different* changes to
the module: the ability to compute TSCCs, and the ability to compute
entry points for both SCCs and TSCCs. Third, we would want to know
the structure of the tail recursive calls inside an SCC: for example,
are they linear?

This last point brings the discussion full circle, back to the start of the thread.

Zoltan.




More information about the developers mailing list