[m-rev.] for review: string switches using tries
Peter Wang
novalazy at gmail.com
Tue Feb 24 12:32:48 AEDT 2015
On Mon, 23 Feb 2015 16:01:54 +1100 (EST), "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone.
>
> We discussed this recently, and I found a way to implement it
> that should work for non-C backends. At each tip of the trie,
> where there is only possible string left that the switched-on variable
> could match, we invoke the new operation offset_str_eq(N, A, B).
> By using the operation, the compiler is asserting that the first N
> code units of A and B are equal, and the operation tests whether
> the remaining code units are equal as well. The non-C backends,
> for which this restricted test is hard or impossible to test efficiently,
> can implement this op simply by comparing the WHOLE of A and B.
> This is slightly wasteful, in that the first N code units have already
> been tested and found to be equal by the trie nodes above, but
> it *will* work, and it will almost certainly be faster than the alternatives.
>
> While working on this diff; I found that the way mlds_to_{cs,java,managed}.m
> handle writing out binops is screwed up. Once this diff is committed,
> someone should fix these, along the lines of the way that this diff
> fixes mlds_to_c.m: by making the code a direct switch on the binop,
> as opposed to the current if-then-else chain, which is VERY vulnerable
> to handling new binops incorrectly.
>
The implementation will not work in general if the host compiler and the
target differ in the string encoding, e.g. the compiler uses UTF-8 but
the target uses UTF-16.
The fix should only require that we replace string.{to,from}_code_unit_list
with functions that deal in the code units of the TARGET string encoding,
and build tries from that. The standard library does not yet have
string.{to,from}_{utf8,utf16}_code_unit_list so, for now, the safe option
is to disable the trie implementation when the string encodings differ.
> ZZZ
> Implement string switches via tries for the MLDS backend.
Delete ZZZ
> diff --git a/compiler/c_util.m b/compiler/c_util.m
> index a955733..f4046f1 100644
> --- a/compiler/c_util.m
> +++ b/compiler/c_util.m
> @@ -146,14 +146,17 @@
>
> :- type binop_category
> ---> array_index_binop
> + ; string_index_binop
> ; pointer_compare_binop
> ; compound_compare_binop
> + ; offset_string_compare_binop(int)
> + ; general_string_compare_binop
> ; string_compare_binop
Are both general_string_compare_binop and string_compare_binop used?
> diff --git a/compiler/ml_string_switch.m b/compiler/ml_string_switch.m
> index a15fee0..1a1b634 100644
> --- a/compiler/ml_string_switch.m
> +++ b/compiler/ml_string_switch.m
...
> +%
> +% All of these techniques are implemented both for switches in which
> +% each arm executes code (we call these the "jump" versions, since the task
> +% of the indexing method is to decide which piece of code to jump to),
> +% and for switches in which each arm looks up the data to return in tables
> +% (we call these the "lookup" versions). The lookup versions themselves
> +% come in two distinct flavours: those in which each harm has at most
> +% one solution, and those in which some arms gave more than one soluion.
s/harm/arm
> +%-----------------------------------------------------------------------------%
> +
> +:- pred create_trie(list(tagged_case)::in, int::out, trie_node::out) is det.
> +
> +create_trie(TaggedCases, MaxCaseNum, TopTrieNode) :-
> + build_str_case_id_assoc_list(TaggedCases, -1, MaxCaseNum, [], StrsCaseIds),
> + % The order of StrsCaseIds does not matter; we will build the same trie
> + % regardless of the order.
> + (
> + StrsCaseIds = [],
> + TopTrieNode = trie_choice([], map.init, no)
> + ;
> + StrsCaseIds = [HeadStrCaseId | TailStrCaseIds],
> + HeadStrCaseId = HeadStr - HeadCaseId,
> + string.to_code_unit_list(HeadStr, HeadStrCodeUnits),
> + TopTrieNode1 = trie_leaf([], HeadStrCodeUnits, HeadCaseId),
> + insert_cases_into_trie(TailStrCaseIds, TopTrieNode1, TopTrieNode)
> + ).
> +
> +:- pred insert_cases_into_trie(assoc_list(string, case_id)::in,
> + trie_node::in, trie_node::out) is det.
> +
> +insert_cases_into_trie([], !TrieNode).
> +insert_cases_into_trie([Case | Cases], !TrieNode) :-
> + Case = Str - CaseId,
> + string.to_code_unit_list(Str, StrCodeUnits),
> + insert_case_into_trie_node([], StrCodeUnits, CaseId, !TrieNode),
> + insert_cases_into_trie(Cases, !TrieNode).
> +
> +:- pred insert_case_into_trie_node(list(int)::in, list(int)::in, case_id::in,
> + trie_node::in, trie_node::out) is det.
> +
> +insert_case_into_trie_node(InsertMatched, InsertNotYetMatched, InsertCaseId,
> + TrieNode0, TrieNode) :-
> + (
> + TrieNode0 = trie_leaf(LeafMatched, LeafNotYetMatched, LeafCaseId),
> + expect(unify(LeafMatched, InsertMatched), $module, $pred,
> + "LeafMatched didn't"),
> + (
> + LeafNotYetMatched = [],
> + ChoiceMap0 = map.init,
> + MaybeEnd0 = yes(LeafCaseId)
> + ;
> + LeafNotYetMatched = [LeafFirstCodeUnit | LeafLaterCodeUnits],
> + NewLeaf = trie_leaf([LeafFirstCodeUnit | LeafMatched],
> + LeafLaterCodeUnits, LeafCaseId),
> + ChoiceMap0 = map.singleton(LeafFirstCodeUnit, NewLeaf),
> + MaybeEnd0 = no
> + )
> + ;
> + TrieNode0 = trie_choice(ChoiceMatched, ChoiceMap0, MaybeEnd0),
> + expect(unify(ChoiceMatched, InsertMatched), $module, $pred,
> + "ChoiceMatched didn't")
> + ),
> + insert_case_into_trie_choice(InsertMatched, InsertNotYetMatched,
> + InsertCaseId, ChoiceMap0, ChoiceMap, MaybeEnd0, MaybeEnd),
> + TrieNode = trie_choice(InsertMatched, ChoiceMap, MaybeEnd).
> +:- pred insert_case_into_trie_choice(list(int)::in, list(int)::in, case_id::in,
> + map(int, trie_node)::in, map(int, trie_node)::out,
> + maybe(case_id)::in, maybe(case_id)::out) is det.
> +
> +insert_case_into_trie_choice(InsertMatched, InsertNotYetMatched, InsertCaseId,
> + ChoiceMap0, ChoiceMap, MaybeEnd0, MaybeEnd) :-
> + (
> + InsertNotYetMatched = [],
> + ChoiceMap = ChoiceMap0,
> + (
> + MaybeEnd0 = no,
> + MaybeEnd = yes(InsertCaseId)
> + ;
> + MaybeEnd0 = yes(_),
> + % You can't have more than one occurrence of a string
> + % as a cons_id in a switch.
> + unexpected($module, $pred, "two strings end at same trie node")
> + )
> + ;
> + InsertNotYetMatched = [InsertFirstCodeUnit | InsertLaterCodeUnits],
> + MaybeEnd = MaybeEnd0,
> + ( if map.search(ChoiceMap0, InsertFirstCodeUnit, SubTrieNode0) then
> + insert_case_into_trie_node([InsertFirstCodeUnit | InsertMatched],
> + InsertLaterCodeUnits, InsertCaseId, SubTrieNode0, SubTrieNode),
> + map.det_update(InsertFirstCodeUnit, SubTrieNode,
> + ChoiceMap0, ChoiceMap)
> + else
> + SubTrieNode = trie_leaf([InsertFirstCodeUnit | InsertMatched],
> + InsertLaterCodeUnits, InsertCaseId),
> + map.det_insert(InsertFirstCodeUnit, SubTrieNode,
> + ChoiceMap0, ChoiceMap)
> + )
> + ).
> +
> + % Generate the following local variable declarations:
> + % int case_num = -1;
> + %
declaration
> +:- pred ml_gen_trie_case_num_var_and_init(mlds_context::in,
> + mlds_lval::out, mlds_defn::out, statement::out,
> + ml_gen_info::in, ml_gen_info::out) is det.
> +
> +ml_gen_trie_case_num_var_and_init(MLDS_Context, CaseNumVarLval, CaseNumVarDefn,
> + InitStatement, !Info) :-
> + ml_gen_info_new_aux_var_name("case_num", CaseNumVar, !Info),
> + CaseNumVarType = mlds_native_int_type,
> + % We never need to trace ints.
> + CaseNumVarDefn = ml_gen_mlds_var_decl(mlds_data_var(CaseNumVar),
> + CaseNumVarType, gc_no_stmt, MLDS_Context),
> + ml_gen_var_lval(!.Info, CaseNumVar, CaseNumVarType, CaseNumVarLval),
> +
> + InitAssign = assign(CaseNumVarLval, ml_const(mlconst_int(-1))),
> + InitStmt = ml_stmt_atomic(InitAssign),
> + InitStatement = statement(InitStmt, MLDS_Context).
> +
> +%-----------------------------------------------------------------------------%
> +
> +:- pred convert_trie_to_nested_switches(mlds_rval::in, mlds_lval::in,
> + mlds_context::in, int::in, trie_node::in, statement::out) is det.
> +
> +convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context, NumMatched,
> + TrieNode, Statement) :-
> + (
> + TrieNode = trie_leaf(RevMatchedCodeUnits, NotYetMatchedCodeUnits,
> + CaseId),
> + CaseId = case_id(CaseNum),
> +
> + CaseNumRval = ml_const(mlconst_int(CaseNum)),
> + SetCaseNumVarAssign = assign(CaseNumVarLval, CaseNumRval),
> + SetCaseNumVarStmt = ml_stmt_atomic(SetCaseNumVarAssign),
> + SetCaseNumVarStatement = statement(SetCaseNumVarStmt, Context),
> +
> + AllCodeUnits =
> + list.reverse(RevMatchedCodeUnits) ++ NotYetMatchedCodeUnits,
> + list.length(RevMatchedCodeUnits, NumRevMatchedCodeUnits),
> + expect(unify(NumRevMatchedCodeUnits, NumMatched), $module, $pred,
> + "NumRevMatchedCodeUnits != NumMatched"),
> + ( if string.from_code_unit_list(AllCodeUnits, String) then
> + StringRval = ml_const(mlconst_string(String))
> + else
> + unexpected($module, $pred,
> + "code units cannot be turned back into string")
> + ),
> + CondRval = ml_binop(offset_str_eq(NumMatched), VarRval, StringRval),
> + Stmt = ml_stmt_if_then_else(CondRval, SetCaseNumVarStatement, no)
> + ;
> + TrieNode = trie_choice(RevMatchedCodeUnits, ChoiceMap, MaybeEnd),
> + CurCodeUnitRval = ml_binop(string_unsafe_index_code_unit,
> + VarRval, ml_const(mlconst_int(NumMatched))),
> + map.to_assoc_list(ChoiceMap, ChoicePairs),
> + ( if
> + ChoicePairs = [OneChoicePair],
> + MaybeEnd = no
> + then
> + OneChoicePair = OneCodeUnit - OneSubTrieNode,
> + OneCodeUnitConst = ml_const(mlconst_int(OneCodeUnit)),
> + FirstCond = ml_binop(eq, CurCodeUnitRval, OneCodeUnitConst),
> + chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context,
> + NumMatched + 1, OneSubTrieNode, FirstCond, AllCond,
> + ThenStatement),
> + Stmt = ml_stmt_if_then_else(AllCond, ThenStatement, no)
> + else
> + convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval,
> + Context, NumMatched + 1, ChoicePairs,
> + cord.init, SwitchArmsCord0),
> + SwitchArms0 = cord.list(SwitchArmsCord0),
> + (
> + MaybeEnd = no,
> + SwitchArms = SwitchArms0
> + ;
> + MaybeEnd = yes(EndCaseId),
> + EndCaseId = case_id(EndCaseNum),
> +
> + EndCaseNumRval = ml_const(mlconst_int(EndCaseNum)),
> + EndSetCaseNumVarAssign = assign(CaseNumVarLval, EndCaseNumRval),
> + EndSetCaseNumVarStmt = ml_stmt_atomic(EndSetCaseNumVarAssign),
> + EndSetCaseNumVarStatement =
> + statement(EndSetCaseNumVarStmt, Context),
> + NullCodeUnit = 0, % Match the terminating NULL character.
s/NULL/NUL or s/NULL/null
This won't work for backends which do not use a NUL terminator.
> + NullMatchCond =
> + match_value(ml_const(mlconst_int(NullCodeUnit))),
> + EndSwitchArm = mlds_switch_case(NullMatchCond, [],
> + EndSetCaseNumVarStatement),
> + SwitchArms = [EndSwitchArm | SwitchArms0]
> + ),
> + list.length(RevMatchedCodeUnits, NumRevMatchedCodeUnits),
> + expect(unify(NumRevMatchedCodeUnits, NumMatched), $module, $pred,
> + "NumRevMatchedCodeUnits != NumMatched"),
> + SwitchCodeUnitRval = CurCodeUnitRval,
> + % Could we set this to a known range? If we could,
> + % would it be useful?
> + SwitchRange = mlds_switch_range_unknown,
> + Stmt = ml_stmt_switch(mlds_native_int_type, SwitchCodeUnitRval,
> + SwitchRange, SwitchArms, default_do_nothing)
> + )
> + ),
> + Statement = statement(Stmt, Context).
> +
> +:- pred convert_trie_choices_to_nested_switches(mlds_rval::in, mlds_lval::in,
> + mlds_context::in, int::in, assoc_list(int, trie_node)::in,
> + cord(mlds_switch_case)::in, cord(mlds_switch_case)::out) is det.
> +
> +convert_trie_choices_to_nested_switches(_, _, _, _, [], !SwitchArmsCord).
> +convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval, Context,
> + NumMatched, [ChoicePair | ChoicePairs], !SwitchArmsCord) :-
> + ChoicePair = CodeUnit - SubTrieNode,
> + convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context,
> + NumMatched, SubTrieNode, SwitchArmStatement),
> + MatchCond = match_value(ml_const(mlconst_int(CodeUnit))),
> + SwitchArm = mlds_switch_case(MatchCond, [], SwitchArmStatement),
> + !:SwitchArmsCord = cord.snoc(!.SwitchArmsCord, SwitchArm),
> + convert_trie_choices_to_nested_switches(VarRval, CaseNumVarLval, Context,
> + NumMatched, ChoicePairs, !SwitchArmsCord).
> +:- pred generate_trie_arms(assoc_list(case_id, statement)::in,
> + list(mlds_switch_case)::in, list(mlds_switch_case)::out) is det.
> +
> +generate_trie_arms([], !RevSwitchCases).
> +generate_trie_arms([CasePair | CasePairs], !RevSwitchCases) :-
> + CasePair = CaseId - CaseStatement,
> + CaseId = case_id(CaseNum),
> + MatchCond = match_value(ml_const(mlconst_int(CaseNum))),
> + Case = mlds_switch_case(MatchCond, [], CaseStatement),
> + !:RevSwitchCases = [Case | !.RevSwitchCases],
> + generate_trie_arms(CasePairs, !RevSwitchCases).
> +
> +:- pred chase_one_cond_trie_nodes(mlds_rval::in, mlds_lval::in,
> + mlds_context::in, int::in, trie_node::in, mlds_rval::in, mlds_rval::out,
> + statement::out) is det.
> +
> +chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context, NumMatched,
> + TrieNode, RevCond0, RevCond, ThenStatement) :-
> + ( if
> + TrieNode = trie_choice(RevMatchedCodeUnits, ChoiceMap, MaybeEnd),
> + map.to_assoc_list(ChoiceMap, ChoicePairs),
> + ChoicePairs = [OneChoicePair],
> + MaybeEnd = no
> + then
> + expect(unify(list.length(RevMatchedCodeUnits), NumMatched),
> + $module, $pred, "length(RevMatchedCodeUnits) != NumMatched"),
> + OneChoicePair = OneCodeUnit - OneSubTrieNode,
> + CurCodeUnitRval = ml_binop(string_unsafe_index_code_unit,
> + VarRval, ml_const(mlconst_int(NumMatched))),
> + OneCodeUnitConst = ml_const(mlconst_int(OneCodeUnit)),
> + CurCond = ml_binop(eq, CurCodeUnitRval, OneCodeUnitConst),
> + RevCond1 = ml_binop(logical_and, RevCond0, CurCond),
> + chase_one_cond_trie_nodes(VarRval, CaseNumVarLval, Context,
> + NumMatched + 1, OneSubTrieNode, RevCond1, RevCond, ThenStatement)
> + else
> + RevCond = RevCond0,
> + convert_trie_to_nested_switches(VarRval, CaseNumVarLval, Context,
> + NumMatched, TrieNode, ThenStatement)
> + ).
> +
> @@ -1486,7 +2085,7 @@ ml_gen_string_binary_switch_search(MLDS_Context, InitialComment,
> %-----------------------------------------------------------------------------%
> %-----------------------------------------------------------------------------%
> %
> -% Code useful for all kinds of string switches.
> +% Code useful for more than one kind of string switches.
> %
switch
> diff --git a/compiler/ml_switch_gen.m b/compiler/ml_switch_gen.m
> index 2232eb6..4b7c42e 100644
> --- a/compiler/ml_switch_gen.m
> +++ b/compiler/ml_switch_gen.m
> @@ -264,70 +270,111 @@ ml_gen_smart_string_switch(SwitchVar, CanFail, TaggedCases, CodeModel, Context,
...
> + ;
> + % We can use a trie, a hash switch or binary switch, and we prefer
> + % them in that order, with one exception, which is that if the switch
> + % is a lookup switch, we won't use tries, since they don't support
> + % lookups.
Out of date comment?
> + %
> + % We can implement all three methods (tries, hash tables and binary
> + % searches using either computed gotos or int switches in the target
> + % language.
missing )
> + ml_is_lookup_switch(SwitchVar, FilteredTaggedCases, GoalInfo,
> + CodeModel, MaybeLookupSwitchInfo, !Info),
> + ml_gen_var(!.Info, SwitchVar, SwitchVarLval),
> + SwitchVarRval = ml_lval(SwitchVarLval),
> + (
> + globals.lookup_int_option(Globals, string_trie_switch_size,
> + StringTrieSwitchSize),
> + NumConsIds >= StringTrieSwitchSize,
> + globals.get_target(Globals, target_c)
> ->
> diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
> index 5cc74db..94b7e20 100644
> --- a/runtime/mercury_string.h
> +++ b/runtime/mercury_string.h
> @@ -384,6 +384,21 @@ MR_Integer MR_hash_string6(MR_ConstString);
> #define MR_strcmp(s, t) strcmp((const char *)(s), (const char *)(t))
>
> /*
> +** Assuming that the first n characters of s and t are equal,
> +** are the rest of s and t equal?
> +*/
code units
> +
> +#define MR_offset_streq(n, s, t) \
> + (strcmp((const char *)((s)+(n)), (const char *)((t)+(n))) == 0)
> +
> +/*
> +** Return the kth code unit in a string.
> +*/
> +
> +#define MR_nth_code_unit(s, k) \
> + ((unsigned) (((const unsigned char *) (s))[(k)]))
> +
> +/*
> ** Return an MR_String which has been created using the format string, fmt,
> ** passed to sprintf. If memory profiling is turned on, record the allocation
> ** as coming from proclabel. The MR_String returned has been allocated
Peter
More information about the reviews
mailing list