[m-dev.] I want to move and rename the dependency_graph module
Paul Bone
paul at bone.id.au
Wed Feb 15 13:08:47 AEDT 2017
On Tue, Feb 14, 2017 at 04:58:05PM +1100, Zoltan Somogyi wrote:
>
> On Mon, 13 Feb 2017 17:10:20 +1100, Paul Bone <paul at bone.id.au> wrote:
> > I hadn't yet planned how to do it. It's fairer to say that I was planning
> > how to plan to do it ;-) There are a number of things that we (YesLogic has
> > had some discussions) think are worth trying.
>
> My original question was about what designs for ParamStruct you have explored,
> but it seems I have gone further on this than you have. A brain dump follows,
> interleaved with relevant parts of your message.
>
> All the following assumes model_det code; for model_semi, what I present below
> would need tweaks, while for model_non, the whole discussion is moot.
>
> > There are three approaches worth trying and their combinations.
> >
> > + Inlining
> > + We'll do it
> > + Let the C compiler do it.
>
> In terms of implementation, the "inlining" approach is completely
> independent of the others.
>
> Lets say an SCC contains n procedures, P1 through Pn.
> If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
> Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
> calls the first, then inlining is clearly what we want to do. For each
> Pi that is called from above the SCC, we would inline the callee at every
> tail recursive call site except the one that calls Pi itself. This will give
> both the Mercury compiler and later the target language compiler the
> best chance to optimize the code. (Any recursive calls that are not TAIL
> recursive would be left alone.)
>
> If the number of entry points to the SCC is Ne, this will yield Ne copies
> of the code of every procedure in the SCC. However, I think we can handle that.
>
We have to include any recursive calls that are left alone in the number Ne
(for this purpose). So it could be higher.
> First, if Ne is 2 or 3 or even 4, the code size increase is probably a price
> most people would be willing to pay for reducing stack usage from linear
> in the depth of the tail recursion to constant. Second, if Ne is so high
> that the user would not want to pay the code size cost, it is trivial to
> add a limit: don't do this if Ne exceeds a configurable threshold.
> Third, in a nontrivial number cases Ne will in fact be just one. This is
> because programmers may split the code executed in each iteration of a loop
> into more than one piece when its length, complexity and/or its indentation
> level becomes too much.
I wanted to think about in what cases inlining non-tail calls would also
help.
a(...) :-
...,
b(...).
b(...) :-
...,
c(...),
....
c(...) :-
...,
a(...).
Inlining c into b in this situation won't help straight away, the call to a
in c (now in b) is still not in tail position.
Inlining a into c, and b into a, will help, it'll still use linear stack
space but not as much as before. But it requires that b is either the entry
point of the SCC, or that the path(s) through the SCC towards b are
duplicated.
Regarding code duplication. We would also want to consider inlining in
branches:
a(x, ...) :-
...,
b(...).
a(y, ...) :-
...,
b(...).
b(...) :-
...,
a(...).
Inlining b into a will create more code. This is also something we'd need
to factor into inlining decisions.
I would propose a cost model, except that we cannot hope to (analytically)
know the cost of code duplication vs the cost of non-tail calls. However I
think it would be useful to create and agree upon some kind of model. Even
if we can't compare code duplication and non-tail calls, we can calculate
code duplication.
> > I also want to
> > read the HLDS inclining code to get a sense for what can be done easily.
>
> Inlining.m is missing other good inlining heuristics as well;
> it has just the most basic ones. Don't take its existing structure
> as sancrosanct.
Yep, only insofar as what's easy to add without too much effort.i
> > We'll do it
> > -----------
> >
> > This is the optimisation that we discussed last year.
> > ...
> > The struct is passed by reference between all the members of the
> > SCC.
>
> The trampoline handles only TAIL recursive calls, so that if e.g.
> Pk is not called by a TAIL call in the SCC, then (a) ParamStruct will never
> need to handle the parameters of Pk, and likewise (b) fp will never point
> to Pk.
True. My bad, i wasn't being precise.
> Lets call the set of Pi that have TAIL calls to them in the SCC the TSCC.
> The rest of this email will restricts its attention to the TSCC.
>
> By definition, all the Pi in the TSCC have the same vector of output arguments
> in terms of type and meaning, although the names of the variables representing
> the same argument may differ between procedures. However, this is not true
> for their input arguments.
>
> Suppose the input arguments of Pi are PiI1 ... PiImi. The simplest design
> for ParamStruct is something that corresponds to this C type,
> where e.g. P1I1 stands for the type and the name of the first input arg
> of the first proc in the TSCC.
>
> struct {
> P1I1,
> P1I2,
> ...
> P1Im1,
>
> P2I1,
> P2I2,
> ...
> P2Im2,
>
> ...
>
> PnI1,
> PnI2,
> ...
> PnImn
> }
>
> (I am ignoring the outputs, but they would just be additional fields here.)
>
> This is what you proposed, and it should work in all of our current MLDS
> target languages.
I was thinking of something that is the super-set each PiI1..In set. Using
unions (your suggestion below) would be good when it's supported.
> The next simplest is this type:
>
> union {
> struct {
> P1I1,
> P1I2,
> ...
> P1Im1
> },
> struct {
> P2I1,
> P2I2,
> ...
> P2Im2
> },
> ...
> struct {
> PnI1,
> PnI2,
> ...
> PnImn
> }
> }
>
> This would have smaller stack usage when targeting C, but would not work
> for Java; I don't about C#. Also, it would need extension to the MLDS itself;
> it already has init_struct, but it doesn't have init_union. Also, I don't
> believe the MLDS has have real any support for structs on the stack, though
> it definitely does have support for structs on the heap and in read-only
> memory. This is because it always treats the variables it stores in stack
> frames individually, not as a group.
>
> When I thought about that, I realized that the trampoline loop does NOT
> have to look like this:
>
> > ParamStruct s;
> > ...
> > while (fp != NULL) {
> > fp = (Func*) (*fp)(&s);
> > }
>
> It can look like this:
>
> // PiIj for all i and j, NOT in a struct
> while (p_to_call != 0) {
> switch (p_to_call) {
> 1:
> // the code of P1, in which each tail call (to e.g. Pk)
> // is replaced with
> //
> // assignments to PkI1, ... PkImk;
> // an assignment of k to p_to_call, and
> // a continue
> p_to_call = 0;
> continue;
> 2:
> // Likewise for the other procedures in the TSCC
> ...
> }
> }
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.
In either case I like this idea, after inlining, the C compiler has a lot of
freedom in how it places and accesses the variables that were the input
arguments between the calls.
> To achieve this, the MLDS backend would have to do code generation SCC by SCC.
> For each SCC, it would need to know what TSCCs, if any, exist inside it.
> It would do this by using the same algorithm to find SCCs as we already use,
> but this time, using as edges only the calls in the SCC that are both
> RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
> into one or more TSCCs. (While all the procedures in the SCC are by definition
> reachable from all other procedures in the SCC, they need not be reachable
> via TAIL calls from all other procedures in the SCC.)
>
> For TSCCs that contain only one procedure, we would translate that procedure
> the usual way. For TSCCs that contain two or more procedures, we would want
> to generate code that follows the scheme above.
Would an MLDS->MLDS transformation be easier?
> For this, we would want a generalized version of the existing ml_gen_proc
> in the compiler, which can be told that the procedure is part of a TSCC.
> For such procedures, ml_proc_gen would need to put something into the
> ml_gen_info that causes ml_call_gen to follow the template above for
> tail recursive calls. It would also need to generate target language
> variable names that have distinguishing prefix or suffix, so that
> they won't clash with the names of target variables generated for
> possibly-identically named variables in other procs in the TSCC.
>
> The code generated for a procedure this way would be the body of one of
> the switch arms above. We would need to generate the switch itself,
> and everything that goes with it, for each entry point in the TSCC.
> (Note that a procedure can be an entry point of a TSCC *without* being
> an entry point of the SCC that contains the TSCC.) If the TSCC has just
> one entry point, that is good. If it has more than one entry point,
> this would mean code duplication, and the only difference between the copies
> would be the code before the switch that initializes both p_to_call and
> the input arguments of the procedure to call. It should be possible
> to avoid this by generating a procedure whose argument list is
>
> - all of the PiIj, and
> - p_to_call.
>
> Then for each entry point, we could just call this TSCC procedure
> specifying the id of the procedure as p_to_call, the actual values
> of its input arguments, and dummy values for the input args of all
> the other procs in the TSCC.
>
> However, I am not sure about how easy it would be to generate dummy
> arguments that would be type correct in each target language.
Java and C# both have null which can be used for objects. I think we
understand the possible primitive types (like int) and can use values such
as 0.
Can a user give a foreign_type pragma with a primitive Java/C# type? I
don't think that would interact with Mercury's polymorphism well so it's
probably not supported.
> I also think that duplicating the switch once, and may be twice,
> would be preferable performance-wise to having to pass the dummy arguments
> in the first place.
Yes, it definitely gives the C/Java/C# compiler more freedom.
> The setting of p_to_call to a value, followed by a continue that
> goes immediately to a switch on p_to_call, is effectively a goto
> to the switch arm selected by the value assigned to p_to_call.
> I hope that most target language compilers, including gcc and clang,
> would recognize this fact, effectively yielding the code you proposed.
> However, letting that target language compiler do this would let us avoid
> including the concepts of labels and gotos in the MLDS. (In fact, I think
> that the absence of those concepts from the MLDS is one of the important
> things that differentiates the MLDS from the LLDS.)
We already have goto and labels in the MLDS. Using them rather than a
switch is probably okay. I don't know what the backend compilers will
optimize WRT switches. Using a goto to hreak out of the loop would also be
more direct, it allows us to use an infinite loop and avoid the branch
associated with the loop condition.
... Snip ...
> > Third, it seems that if things are "just right" compilers like GCC / Clang
> > can do the tail recursion themselves. In practice "just right" means that
> > the parameter lists and return values are the same.
>
> I think that approach would pretty fragile. We would want to give guarantees
> about what tail calls consume stack and what calls don't; I don't think we can
> subcontract such guarantees to the target language compiler.
>
> > We could either use the
> > struct from above, or simply make the parameter lists match by adding unused
> > arguments as necessary.
>
> The latter is unnecessarily slow.
>
> > This may share code with the above idea, simply
> > disabling the use of a trampoline when we know the C compiler matches one
> > that we know will do the mutual recursion.
>
> That test would be a bitch to keep up to date.
I don't think it would be that hard. However I prefer the above solution
anyway, I think it will (usually) lead to better code/performance.
If the above solution would create too much code duplication, particularly
when we can't get constant stack space due to non-tail recursive calls,
then I think we should fall back to either the trampoline, or letting the C
compiler do it.
> > If we're able to place all the procedures in a single C function with gotos
> > for tailcalls then that may be better than letting the C compiler implement
> > the tailcalls, we'd need to test.
>
> If the best that you can hope for from having the C compiler optimize tail calls
> is that the optimization transforms code that uses separate functions into
> code that replaces the tail calls with gotos in a merged function body,
> then generating separate functions cannot yield faster code; the only reason
> for preferring that approach would be easier coding on our part. I think
> the approach I proposed above should be easy enough to implement.
I agree, you're probably right that that is all the C compiler attempts to
do. Since most C code won't depend on LCO. If (big IF) it implemented
proper LCO then this would be different. But if it implemented proper LCO
then it wouldn't require the argument lists to match anyway.
> ------------------------------
>
> Another issue is that mutually recursive procedures often have one or more
> arguments that are passed to them from above the SCC and which are passed down
> the chain of recursive calls. In some cases, the argument is always passed
> unchanged; in some other cases, it is sometimes passed unchanged, and
> sometimes passed updated. In both cases, we should avoid the need for
> every tail recursive call to set the value of these parameters, because
> that is wasted work in the common case that the value is passed along
> unchanged.
>
> With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
> P3I2. The target language compiler may find out that it can store all these
> variables in the same stack slot, making assignments such as P3I2 = P2I1
> just before a tail call from P2 to P3 a no-operation. If some target language
> compilers don't, then the Mercury compiler could itself compute which
> the sets of input arguments, one argument from each TSCC, form a single
> parameters in this sense. We could then get ml_gen_proc to refer to that
> input argument by the shared name, and ml_call_gen to optimize its passing.
Implementing this in the MLDS as a seperate excess assignment elimination
wouldn't be too difficult.
> ------------------------------
>
> The above approach places completely different demands on the dependency
> graph module than your approach. First, it means that we DON'T need
> dependency graphs for the MLDS, so the splitting of the dependency_graph.m
> module isn't needed either. (Though moving it from transform_hlds to hlds
> is still a good idea.) Second, we would need *different* changes to
> the module: the ability to compute TSCCs, and the ability to compute
> entry points for both SCCs and TSCCs. Third, we would want to know
> the structure of the tail recursive calls inside an SCC: for example,
> are they linear?
I have a version of ml_dependency_graph.m in my workspace. It can be
modified to add support for TSCCs. My big question at this point is whether
we'd prefer to do this as an MLDS->MLDS transformation.
They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
> This last point brings the discussion full circle, back to the start of
> the thread.
>
:-)
--
Paul Bone
http://paul.bone.id.au
More information about the developers
mailing list