[m-rev.] for review: string switches using tries

Peter Wang novalazy at gmail.com
Tue Feb 24 12:32:48 AEDT 2015


On Mon, 23 Feb 2015 16:01:54 +1100 (EST), "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone.
> 
> We discussed this recently, and I found a way to implement it
> that should work for non-C backends. At each tip of the trie,
> where there is only possible string left that the switched-on variable
> could match, we invoke the new operation offset_str_eq(N, A, B).
> By using the operation, the compiler is asserting that the first N
> code units of A and B are equal, and the operation tests whether
> the remaining code units are equal as well. The non-C backends,
> for which this restricted test is hard or impossible to test efficiently,
> can implement this op simply by comparing the WHOLE of A and B.

> This is slightly wasteful, in that the first N code units have already
> been tested and found to be equal by the trie nodes above, but
> it *will* work, and it will almost certainly be faster than the alternatives.
> 
> While working on this diff; I found that the way mlds_to_{cs,java,managed}.m
> handle writing out binops is screwed up. Once this diff is committed,
> someone should fix these, along the lines of the way that this diff
> fixes mlds_to_c.m: by making the code a direct switch on the binop,
> as opposed to the current if-then-else chain, which is VERY vulnerable
> to handling new binops incorrectly.
> 

The implementation will not work in general if the host compiler and the
target differ in the string encoding, e.g. the compiler uses UTF-8 but
the target uses UTF-16.

The fix should only require that we replace string.{to,from}_code_unit_list
with functions that deal in the code units of the TARGET string encoding,
and build tries from that.  The standard library does not yet have
string.{to,from}_{utf8,utf16}_code_unit_list so, for now, the safe option
is to disable the trie implementation when the string encodings differ.


> ZZZ
> Implement string switches via tries for the MLDS backend.

Delete ZZZ

> diff --git a/compiler/c_util.m b/compiler/c_util.m
> index a955733..f4046f1 100644
> --- a/compiler/c_util.m
> +++ b/compiler/c_util.m
> @@ -146,14 +146,17 @@
>  
>  :- type binop_category
>      --->    array_index_binop
> +    ;       string_index_binop
>      ;       pointer_compare_binop
>      ;       compound_compare_binop
> +    ;       offset_string_compare_binop(int)
> +    ;       general_string_compare_binop
>      ;       string_compare_binop

Are both general_string_compare_binop and string_compare_binop used?

> diff --git a/compiler/ml_string_switch.m b/compiler/ml_string_switch.m
> index a15fee0..1a1b634 100644
> --- a/compiler/ml_string_switch.m
> +++ b/compiler/ml_string_switch.m
...
> +%
> +% All of these techniques are implemented both for switches in which
> +% each arm executes code (we call these the "jump" versions, since the task
> +% of the indexing method is to decide which piece of code to jump to),
> +% and for switches in which each arm looks up the data to return in tables
> +% (we call these the "lookup" versions). The lookup versions themselves
> +% come in two distinct flavours: those in which each harm has at most
> +% one solution, and those in which some arms gave more than one soluion.

s/harm/arm

> +%-----------------------------------------------------------------------------%
> +
> +:- pred create_trie(list(tagged_case)::in, int::out, trie_node::out) is det.
> +
> +create_trie(TaggedCases, MaxCaseNum, TopTrieNode) :-
> +    build_str_case_id_assoc_list(TaggedCases, -1, MaxCaseNum, [], StrsCaseIds),
> +    % The order of StrsCaseIds does not matter; we will build the same trie
> +    % regardless of the order.
> +    (
> +        StrsCaseIds = [],
> +        TopTrieNode = trie_choice([], map.init, no)
> +    ;
> +        StrsCaseIds = [HeadStrCaseId | TailStrCaseIds],
> +        HeadStrCaseId = HeadStr - HeadCaseId,
> +        string.to_code_unit_list(HeadStr, HeadStrCodeUnits),
> +        TopTrieNode1 = trie_leaf([], HeadStrCodeUnits, HeadCaseId),
> +        insert_cases_into_trie(TailStrCaseIds, TopTrieNode1, TopTrieNode)
> +    ).
> +
> +:- pred insert_cases_into_trie(assoc_list(string, case_id)::in,
> +    trie_node::in, trie_node::out) is det.
> +
> +insert_cases_into_trie([], !TrieNode).
> +insert_cases_into_trie([Case | Cases], !TrieNode) :-
> +    Case = Str - CaseId,
> +    string.to_code_unit_list(Str, StrCodeUnits),
> +    insert_case_into_trie_node([], StrCodeUnits, CaseId, !TrieNode),
> +    insert_cases_into_trie(Cases, !TrieNode).
> +

> +:- pred insert_case_into_trie_node(list(int)::in, list(int)::in, case_id::in,
> +    trie_node::in, trie_node::out) is det.
> +
> +insert_case_into_trie_node(InsertMatched, InsertNotYetMatched, InsertCaseId,
> +        TrieNode0, TrieNode) :-
> +    (
> +        TrieNode0 = trie_leaf(LeafMatched, LeafNotYetMatched, LeafCaseId),
> +        expect(unify(LeafMatched, InsertMatched), $module, $pred,
> +            "LeafMatched didn't"),
> +        (
> +            LeafNotYetMatched = [],
> +            ChoiceMap0 = map.init,
> +            MaybeEnd0 = yes(LeafCaseId)
> +        ;
> +            LeafNotYetMatched = [LeafFirstCodeUnit | LeafLaterCodeUnits],
> +            NewLeaf = trie_leaf([LeafFirstCodeUnit | LeafMatched],
> +                LeafLaterCodeUnits, LeafCaseId),
> +            ChoiceMap0 = map.singleton(LeafFirstCodeUnit, NewLeaf),
> +            MaybeEnd0 = no
> +        )
> +    ;
> +        TrieNode0 = trie_choice(ChoiceMatched, ChoiceMap0, MaybeEnd0),
> +        expect(unify(ChoiceMatched, InsertMatched), $module, $pred,
> +            "ChoiceMatched didn't")
> +    ),
> +    insert_case_into_trie_choice(InsertMatched, InsertNotYetMatched,
> +        InsertCaseId, ChoiceMap0, ChoiceMap, MaybeEnd0, MaybeEnd),
> +    TrieNode = trie_choice(InsertMatched, ChoiceMap, MaybeEnd).

> +:- pred insert_case_into_trie_choice(list(int)::in, list(int)::in, case_id::in,
> +    map(int, trie_node)::in, map(int, trie_node)::out,
> +    maybe(case_id)::in, maybe(case_id)::out) is det.
> +
> +insert_case_into_trie_choice(InsertMatched, InsertNotYetMatched, InsertCaseId,
> +        ChoiceMap0, ChoiceMap, MaybeEnd0, MaybeEnd) :-
> +    (
> +        InsertNotYetMatched = [],
> +        ChoiceMap = ChoiceMap0,
> +        (
> +            MaybeEnd0 = no,
> +            MaybeEnd = yes(InsertCaseId)
> +        ;
> +            MaybeEnd0 = yes(_),
> +            % You can't have more than one occurrence of a string
> +            % as a cons_id in a switch.
> +            unexpected($module, $pred, "two strings end at same trie node")
> +        )
> +    ;
> +        InsertNotYetMatched = [InsertFirstCodeUnit | InsertLaterCodeUnits],
> +        MaybeEnd = MaybeEnd0,
> +        ( if map.search(ChoiceMap0, InsertFirstCodeUnit, SubTrieNode0) then
> +            insert_case_into_trie_node([InsertFirstCodeUnit | InsertMatched],
> +                InsertLaterCodeUnits, InsertCaseId, SubTrieNode0, SubTrieNode),
> +            map.det_update(InsertFirstCodeUnit, SubTrieNode,
> +                ChoiceMap0, ChoiceMap)
> +        else
> +            SubTrieNode = trie_leaf([InsertFirstCodeUnit | InsertMatched],
> +                InsertLaterCodeUnits, InsertCaseId),
> +            map.det_insert(InsertFirstCodeUnit, SubTrieNode,
> +                ChoiceMap0, ChoiceMap)
> +        )
> +    ).

> +
> +    % Generate the following local variable declarations:
> +    %   int         case_num = -1;
> +    %

declaration

> +:- pred ml_gen_trie_case_num_var_and_init(mlds_context::in,
> +    mlds_lval::out, mlds_defn::out, statement::out,
> +    ml_gen_info::in, ml_gen_info::out) is det.
> +
> +ml_gen_trie_case_num_var_and_init(MLDS_Context, CaseNumVarLval, CaseNumVarDefn,
> +        InitStatement, !Info) :-
> +    ml_gen_info_new_aux_var_name("case_num", CaseNumVar, !Info),
> +    CaseNumVarType = mlds_native_int_type,
> +    % We never need to trace ints.
> +    CaseNumVarDefn = ml_gen_mlds_var_decl(mlds_data_var(CaseNumVar),
> +        CaseNumVarType, gc_no_stmt, MLDS_Context),
> +    ml_gen_var_lval(!.Info, CaseNumVar, CaseNumVarType, CaseNumVarLval),
> +
> +    InitAssign = assign(CaseNumVarLval, ml_const(mlconst_int(-1))),
> +    InitStmt = ml_stmt_atomic(InitAssign),
> +    InitStatement = statement(InitStmt, MLDS_Context).
> +

> +%-----------------------------------------------------------------------------%
> +
> +:- pred convert_trie_to_nested_switches(mlds_rval::in, mlds_lval::in,
> +    mlds_context::in, int::in, trie_node::in, statement::out) is det.
> +
> +convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context, NumMatched,
> +        TrieNode, Statement) :-
> +    (
> +        TrieNode = trie_leaf(RevMatchedCodeUnits, NotYetMatchedCodeUnits,
> +            CaseId),
> +        CaseId = case_id(CaseNum),
> +
> +        CaseNumRval = ml_const(mlconst_int(CaseNum)),
> +        SetCaseNumVarAssign = assign(CaseNumVarLval, CaseNumRval),
> +        SetCaseNumVarStmt = ml_stmt_atomic(SetCaseNumVarAssign),
> +        SetCaseNumVarStatement = statement(SetCaseNumVarStmt, Context),
> +
> +        AllCodeUnits =
> +            list.reverse(RevMatchedCodeUnits) ++ NotYetMatchedCodeUnits,
> +        list.length(RevMatchedCodeUnits, NumRevMatchedCodeUnits),
> +        expect(unify(NumRevMatchedCodeUnits, NumMatched), $module, $pred,
> +            "NumRevMatchedCodeUnits != NumMatched"),
> +        ( if string.from_code_unit_list(AllCodeUnits, String) then
> +            StringRval = ml_const(mlconst_string(String))
> +        else
> +            unexpected($module, $pred,
> +                "code units cannot be turned back into string")
> +        ),
> +        CondRval = ml_binop(offset_str_eq(NumMatched), VarRval, StringRval),
> +        Stmt = ml_stmt_if_then_else(CondRval, SetCaseNumVarStatement, no)
> +    ;
> +        TrieNode = trie_choice(RevMatchedCodeUnits, ChoiceMap, MaybeEnd),
> +        CurCodeUnitRval = ml_binop(string_unsafe_index_code_unit,
> +            VarRval, ml_const(mlconst_int(NumMatched))),
> +        map.to_assoc_list(ChoiceMap, ChoicePairs),
> +        ( if
> +            ChoicePairs = [OneChoicePair],
> +            MaybeEnd = no
> +        then
> +            OneChoicePair = OneCodeUnit - OneSubTrieNode,
> +            OneCodeUnitConst = ml_const(mlconst_int(OneCodeUnit)),
> +            FirstCond = ml_binop(eq, CurCodeUnitRval, OneCodeUnitConst),
> +            chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context,
> +                NumMatched + 1, OneSubTrieNode, FirstCond, AllCond,
> +                ThenStatement),
> +            Stmt = ml_stmt_if_then_else(AllCond, ThenStatement, no)
> +        else
> +            convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval,
> +                Context, NumMatched + 1, ChoicePairs,
> +                cord.init, SwitchArmsCord0),
> +            SwitchArms0 = cord.list(SwitchArmsCord0),
> +            (
> +                MaybeEnd = no,
> +                SwitchArms = SwitchArms0
> +            ;
> +                MaybeEnd = yes(EndCaseId),
> +                EndCaseId = case_id(EndCaseNum),
> +
> +                EndCaseNumRval = ml_const(mlconst_int(EndCaseNum)),
> +                EndSetCaseNumVarAssign = assign(CaseNumVarLval, EndCaseNumRval),
> +                EndSetCaseNumVarStmt = ml_stmt_atomic(EndSetCaseNumVarAssign),
> +                EndSetCaseNumVarStatement =
> +                    statement(EndSetCaseNumVarStmt, Context),
> +                NullCodeUnit = 0,    % Match the terminating NULL character.

s/NULL/NUL  or  s/NULL/null

This won't work for backends which do not use a NUL terminator.

> +                NullMatchCond =
> +                    match_value(ml_const(mlconst_int(NullCodeUnit))),
> +                EndSwitchArm = mlds_switch_case(NullMatchCond, [],
> +                    EndSetCaseNumVarStatement),
> +                SwitchArms = [EndSwitchArm | SwitchArms0]
> +            ),

> +            list.length(RevMatchedCodeUnits, NumRevMatchedCodeUnits),
> +            expect(unify(NumRevMatchedCodeUnits, NumMatched), $module, $pred,
> +                "NumRevMatchedCodeUnits != NumMatched"),
> +            SwitchCodeUnitRval = CurCodeUnitRval,
> +            % Could we set this to a known range? If we could,
> +            % would it be useful?
> +            SwitchRange = mlds_switch_range_unknown,
> +            Stmt = ml_stmt_switch(mlds_native_int_type, SwitchCodeUnitRval,
> +                SwitchRange, SwitchArms, default_do_nothing)
> +        )
> +    ),
> +    Statement = statement(Stmt, Context).
> +

> +:- pred convert_trie_choices_to_nested_switches(mlds_rval::in, mlds_lval::in,
> +    mlds_context::in, int::in, assoc_list(int, trie_node)::in,
> +    cord(mlds_switch_case)::in, cord(mlds_switch_case)::out) is det.
> +
> +convert_trie_choices_to_nested_switches(_, _, _, _, [], !SwitchArmsCord).
> +convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval, Context,
> +        NumMatched, [ChoicePair | ChoicePairs], !SwitchArmsCord) :-
> +    ChoicePair = CodeUnit - SubTrieNode,
> +    convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context,
> +        NumMatched, SubTrieNode, SwitchArmStatement),
> +    MatchCond = match_value(ml_const(mlconst_int(CodeUnit))),
> +    SwitchArm = mlds_switch_case(MatchCond, [], SwitchArmStatement),
> +    !:SwitchArmsCord = cord.snoc(!.SwitchArmsCord, SwitchArm),
> +    convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval, Context,
> +        NumMatched, ChoicePairs, !SwitchArmsCord).

> +:- pred generate_trie_arms(assoc_list(case_id, statement)::in,
> +    list(mlds_switch_case)::in, list(mlds_switch_case)::out) is det.
> +
> +generate_trie_arms([], !RevSwitchCases).
> +generate_trie_arms([CasePair | CasePairs], !RevSwitchCases) :-
> +    CasePair = CaseId - CaseStatement,
> +    CaseId = case_id(CaseNum),
> +    MatchCond = match_value(ml_const(mlconst_int(CaseNum))),
> +    Case = mlds_switch_case(MatchCond, [], CaseStatement),
> +    !:RevSwitchCases = [Case | !.RevSwitchCases],
> +    generate_trie_arms(CasePairs, !RevSwitchCases).
> +
> +:- pred chase_one_cond_trie_nodes(mlds_rval::in, mlds_lval::in,
> +    mlds_context::in, int::in, trie_node::in, mlds_rval::in, mlds_rval::out,
> +    statement::out) is det.
> +
> +chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context, NumMatched,
> +        TrieNode, RevCond0, RevCond, ThenStatement) :-
> +    ( if
> +        TrieNode = trie_choice(RevMatchedCodeUnits, ChoiceMap, MaybeEnd),
> +        map.to_assoc_list(ChoiceMap, ChoicePairs),
> +        ChoicePairs = [OneChoicePair],
> +        MaybeEnd = no
> +    then
> +        expect(unify(list.length(RevMatchedCodeUnits), NumMatched),
> +            $module, $pred, "length(RevMatchedCodeUnits) != NumMatched"),
> +        OneChoicePair = OneCodeUnit - OneSubTrieNode,
> +        CurCodeUnitRval = ml_binop(string_unsafe_index_code_unit,
> +            VarRval, ml_const(mlconst_int(NumMatched))),
> +        OneCodeUnitConst = ml_const(mlconst_int(OneCodeUnit)),
> +        CurCond = ml_binop(eq, CurCodeUnitRval, OneCodeUnitConst),
> +        RevCond1 = ml_binop(logical_and, RevCond0, CurCond),
> +        chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context,
> +            NumMatched + 1, OneSubTrieNode, RevCond1, RevCond, ThenStatement)
> +    else
> +        RevCond = RevCond0,
> +        convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context,
> +            NumMatched, TrieNode, ThenStatement)
> +    ).
> +

> @@ -1486,7 +2085,7 @@ ml_gen_string_binary_switch_search(MLDS_Context, InitialComment,
>  %-----------------------------------------------------------------------------%
>  %-----------------------------------------------------------------------------%
>  %
> -% Code useful for all kinds of string switches.
> +% Code useful for more than one kind of string switches.
>  %

switch

> diff --git a/compiler/ml_switch_gen.m b/compiler/ml_switch_gen.m
> index 2232eb6..4b7c42e 100644
> --- a/compiler/ml_switch_gen.m
> +++ b/compiler/ml_switch_gen.m
> @@ -264,70 +270,111 @@ ml_gen_smart_string_switch(SwitchVar, CanFail, TaggedCases, CodeModel, Context,
...
> +    ;
> +        % We can use a trie, a hash switch or binary switch, and we prefer
> +        % them in that order, with one exception, which is that if the switch
> +        % is a lookup switch, we won't use tries, since they don't support
> +        % lookups.

Out of date comment?

> +        %
> +        % We can implement all three methods (tries, hash tables and binary
> +        % searches using either computed gotos or int switches in the target
> +        % language.

missing )

> +        ml_is_lookup_switch(SwitchVar, FilteredTaggedCases, GoalInfo,
> +            CodeModel, MaybeLookupSwitchInfo, !Info),
> +        ml_gen_var(!.Info, SwitchVar, SwitchVarLval),
> +        SwitchVarRval = ml_lval(SwitchVarLval),
> +        (
> +            globals.lookup_int_option(Globals, string_trie_switch_size,
> +                StringTrieSwitchSize),
> +            NumConsIds >= StringTrieSwitchSize,
> +            globals.get_target(Globals, target_c)
>          ->

> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index 5cc74db..94b7e20 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -384,6 +384,21 @@ MR_Integer      MR_hash_string6(MR_ConstString);
>  #define MR_strcmp(s, t)         strcmp((const char *)(s), (const char *)(t))
>  
>  /*
> +** Assuming that the first n characters of s and t are equal,
> +** are the rest of s and t equal?
> +*/

code units

> +
> +#define MR_offset_streq(n, s, t) \
> +    (strcmp((const char *)((s)+(n)), (const char *)((t)+(n))) == 0)
> +
> +/*
> +** Return the kth code unit in a string.
> +*/
> +
> +#define MR_nth_code_unit(s, k) \
> +    ((unsigned) (((const unsigned char *) (s))[(k)]))
> +
> +/*
>  ** Return an MR_String which has been created using the format string, fmt,
>  ** passed to sprintf. If memory profiling is turned on, record the allocation
>  ** as coming from proclabel. The MR_String returned has been allocated


Peter



More information about the reviews mailing list