[m-rev.] for review: optimize field updates of packed args
Zoltan Somogyi
zoltan.somogyi at runbox.com
Mon Sep 10 13:37:50 AEST 2018
On Mon, 10 Sep 2018 12:57:11 +1000, Peter Wang <novalazy at gmail.com> wrote:
For your first question, here is my proposed answer:
@@ -1348,6 +1348,33 @@
+ find_best_matching_instance_loop(TailInstances, FilledPackedWord,
+ HeadInstance, HeadMatches, HeadNonMatches, HeadMissing,
+ BestInstance, BestMatches, _BestNonMatches, BestMissing),
++ % As a heuristic, we want to reuse an existing word only if least two
++ % of its bitfields already contain the right value. If no bitfield
++ % contains the right value, then there is nothing to reuse. And reuse
++ % may be a net loss even if one field contains the right value, because
++ % the reuse approach requires that the bits of the nonreused bitfields
++ % be masked off *before* we can OR in their updated values, forcing
++ % sequential execution of instructions, whereas if we do not reuse
++ % the existing word, then there is no forced sequentiality: the values
++ % of the various bitfields can be ORed with each other in any order
++ % the C compiler finds best. It is true that the nonreuse approach
++ % requires the value of the unchanged bitfield to be available.
++ % Computing the value of that bitfield is a possible extra cost,
++ % but (a) if the rest of the code needs the value of the variable
++ % in that bitfield, we have to incur that cost anyway, and (b) even if
++ % that is not the case, the mask operation doing that computation
++ % can be done in parallel with any mask operations computing the values
++ % of the other bitfields in the same word. The biggest cost of the
++ % nonreusing approach is actually storing the value of the unchanged
++ % bitfield separately from the word containing it. Storing the value
++ % of one such bitfield may cost roughly as much as storing the whole word
++ % or the pointer to it (if its value is not reused while in a register),
++ % but storing the value of *two* such bitfields will almost certainly
++ % have a higher cost.
++ %
++ % However, modern CPUs have such complicated performance characteristics
++ % that the above analysis may be wrong :-(, so exploring other heuristics
++ % may be worthwhile.
+ BestMatches >= 2,
(You are right; that badly needed documenting.)
> > +:- pred count_matching_bitfields_loop(
> > + list(filled_bitfield)::in, list(filled_bitfield)::in,
> > + int::in, int::out, int::in, int::out,
> > + list(filled_bitfield)::in, list(filled_bitfield)::out) is det.
> > +
> > +count_matching_bitfields_loop([], [],
> > + !Matches, !NonMatches, !RevMissingB).
> > +count_matching_bitfields_loop([], [_ | _],
> > + !Matches, !NonMatches, !RevMissingB) :-
> > + unexpected($pred, "length mismatch").
> > +count_matching_bitfields_loop([_ | _], [],
> > + !Matches, !NonMatches, !RevMissingB) :-
> > + unexpected($pred, "length mismatch").
> > +count_matching_bitfields_loop(
> > + [FilledBitfieldA | FilledBitfieldsA],
> > + [FilledBitfieldB | FilledBitfieldsB],
> > + !Matches, !NonMatches, !RevMissingB) :-
> > + count_matching_bitfield(FilledBitfieldA, FilledBitfieldB,
> > + !Matches, !NonMatches, !RevMissingB),
> > + count_matching_bitfields_loop(FilledBitfieldsA, FilledBitfieldsB,
> > + !Matches, !NonMatches, !RevMissingB).
>
> list.foldl3_corresponding
That would be shorter, and if I remembered its existence,
I may have used it instead of writing the above code.
However, this is likely to be a reasonably heavily used part
of the code generator, so I don't think there is any point
in replacing already existing code with a slower alternative.
> > +ml_gen_deconstruct_tagword_args(LHSTagwordLval, CastTagwordRval,
> > + TagwordType, TagFilledBitfield, RHSVarRepns, ArgModes,
> > + Context, Defns, Stmts, !Info) :-
> > + ml_gen_deconstruct_tagword_args_loop(!.Info, CastTagwordRval,
> > + RHSVarRepns, ArgModes, Context, [], ToOrRvals, 0u, ToOrMask,
> > + [], RevArgFilledBitfields,
> > + all_partials_assign_right, AllPartialsRight, RightStmts),
> > + (
> > + ToOrRvals = [],
> > (
> > + AllPartialsRight = not_all_partials_assign_right,
> > + Defns = [],
> > Stmts = RightStmts
>
> Can you explain why this branch differs from the one below?
This is my proposed explanation:
- ToOrRvals = [],
+ AllPartialsRight = not_all_partials_assign_right,
++ % If the unifications of some arguments assign to the left,
++ % but the value being assigned to the left is zero, then
++ % we do not include it in ToOrRvals.
+ Defns = [],
By the way, do you think the code would be clearer if the outer
switch was on AllPartialsRight, and the inner switch was on ToOrRvals?
That would work as well, given that there are only three valid combinations
of their values. I cannot decide, being too close.
> > ;
> > + AllPartialsRight = all_partials_assign_right,
> > + list.reverse(RevArgFilledBitfields, ArgFilledBitfields),
> > + FilledBitfields = [TagFilledBitfield | ArgFilledBitfields],
> > + record_packed_word(FilledBitfields, CastTagwordRval, Context,
> > + Defns, WordVarStmts, !Info),
> > + Stmts = WordVarStmts ++ RightStmts
> > + )
> > + ;
> > + ToOrRvals = [HeadToOrRval | TailToOrRvals],
> > + expect(unify(AllPartialsRight, not_all_partials_assign_right),
> > + $pred, "ToOrRvals != [] but all_partials_assign_right"),
> > + Defns = [],
> > + ToOrRval = ml_bitwise_or_some_rvals(HeadToOrRval, TailToOrRvals),
> > + ComplementMask = ml_const(mlconst_uint(\ ToOrMask)),
> > + MaskedOldTagwordRval = ml_binop(bitwise_and(int_type_uint),
> > + CastTagwordRval, ComplementMask),
> > + NewTagwordRval = ml_binop(bitwise_or(int_type_uint),
> > + MaskedOldTagwordRval, ToOrRval),
> > + ( if TagwordType = mlds_native_uint_type then
> > + CastNewTagwordRval = NewTagwordRval
> > + else
> > + CastNewTagwordRval = ml_cast(TagwordType, NewTagwordRval)
> > ),
> > - Defns = []
> > + LeftStmt = ml_gen_assign(LHSTagwordLval, CastNewTagwordRval, Context),
> > + Stmts = [LeftStmt | RightStmts]
> > )
Kept just to provide context for the question above.
> The rest looked fine.
Thanks for the review.
Zoltan.
More information about the reviews
mailing list