[m-dev.] For review: hash table implementation
Ralph Becket
rbeck at microsoft.com
Thu Feb 1 03:18:53 AEDT 2001
This has been missing for too long. It should be considerably faster
and more economical than maps in many cases (each update consumes only
6 or 7 words of storage, and only 1 if structure reuse works for it).
Of course, against that you have the fact that you're using extra space
up, but in practice occupancy should be between 45 - 90% which is on a
par with 234-trees.
Downsides: arguably more expensive for mapping and folding (I have yet
to supply these functions); weird modes are required; you have to supply
hashing predicates for each key type (anyone got a generic hash
function or two handy?)
Ralph
Estimated hours taken: 6
Added a hash table implementation to the library.
library/hash_table.m:
Created.
Full source:
%
----------------------------------------------------------------------------
%
% hash_table.m
% Ralph Becket <rbeck at microsoft.com>
% Tue Jan 30 15:21:45 GMT 2001
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
%
% Hash table implementation.
%
% This implementation uses double hashing and requires the user to
% supply a predicate that will compute two independent hash values
% for any given key.
%
% The number of buckets in the hash table is always a power of 2.
%
% When a user set occupancy level is achieved, the number of buckets
% in the table is doubled and the previous contents reinserted into
% the new hash table.
%
%
----------------------------------------------------------------------------
%
:- module hash_table.
:- interface.
:- import_module int, assoc_list, float, string, array.
:- type hash_table(K, V).
% XXX This is all fake until the compiler can handle nested unique
modes.
%
:- inst hash_table ==
bound(ht(ground, ground, ground, hash_pred, array, array)).
:- mode hash_table_ui == in(hash_table).
:- mode hash_table_di == di(hash_table).
:- mode hash_table_uo == out(hash_table).
:- type hash_pred(K) == ( pred(K, int, int) ).
:- inst hash_pred == ( pred(in, out, out) is det ).
% new(HashPred, N, MaxOccupancy)
% constructs a new hash table with initial size 2 ^ N that is
% doubled whenever MaxOccupancy is achieved; elements are
% indexed using HashPred.
%
% N must be greater than 1.
% MaxOccupancy must be in (0.0, 1.0).
%
:- func new(hash_pred(K), int, float) = hash_table(K, V).
:- mode new(in(hash_pred), in, in) = hash_table_uo is det.
% new_default(HashFn) constructs a hash table with default size and
% occupancy arguments.
%
:- func new_default(hash_pred(K)) = hash_table(K, V).
:- mode new_default(in(hash_pred)) = hash_table_uo is det.
% Retrieve the hash_pred associated with a hash table.
%
:- func hash_pred(hash_table(K, V)) = hash_pred(K).
:- mode hash_pred(hash_table_ui) = out(hash_pred) is det.
% Default hash_preds for ints and strings.
%
:- func int_hash_pred = hash_pred(int).
:- mode int_hash_pred = out(hash_pred) is det.
:- func string_hash_pred = hash_pred(string).
:- mode string_hash_pred = out(hash_pred) is det.
% Returns the number of buckets in a hash table.
%
:- func num_buckets(hash_table(K, V)) = int.
:- mode num_buckets(hash_table_ui) = out is det.
% Returns the number of occupants in a hash table.
%
:- func num_occupants(hash_table(K, V)) = int.
:- mode num_occupants(hash_table_ui) = out is det.
% Insert an item into a hash table; if one is already there then
% it is overwritten.
%
:- func insert(hash_table(K, V), K, V) = hash_table(K, V).
:- mode insert(hash_table_di, in, in) = hash_table_uo is det.
:- func 'elem :='(K, hash_table(K, V), V) = hash_table(K, V).
:- mode 'elem :='(in, hash_table_di, in) = hash_table_uo is det.
% Delete the entry (if any) for the given key.
%
:- func delete(hash_table(K, V), K) = hash_table(K, V).
:- mode delete(hash_table_di, in) = hash_table_uo is det.
% Lookup the value associated with the given key. An exception
% is raised if there is no entry for the key.
%
:- func lookup(hash_table(K, V), K) = V.
:- mode lookup(hash_table_ui, in) = out is det.
:- func elem(K, hash_table(K, V)) = V.
:- mode elem(in, hash_table_ui) = out is det.
% Like lookup, but just fails if there is no entry for the key.
%
:- pred search(hash_table(K, V), K, V).
:- mode search(hash_table_ui, in, out) is semidet.
% Convert a hash table into an association list.
%
:- func to_assoc_list(hash_table(K, V)) = assoc_list(K, V).
:- mode to_assoc_list(hash_table_ui) = out is det.
%
----------------------------------------------------------------------------
%
%
----------------------------------------------------------------------------
%
:- implementation.
:- import_module exception, math, std_util, bool, exception, char, list.
:- type hash_table(K, V)
---> ht(
num_buckets :: int,
num_occupants :: int,
max_occupants :: int,
hash_pred :: hash_pred(K),
keys :: array(maybe(K)),
values :: array(V)
).
%
----------------------------------------------------------------------------
%
% THE HASHING SCHEME
%
% The user provided hashing function computes two independent
% hashes H1 and H2 for a given key K.
%
% We calculate D = H2 \/ 1 (to ensure that D is non-zero and
% odd and is therefore coprime to the number of buckets) and
% probe the table where the Kth probe examines slot
%
% (H1 + K * H2) `rem` num_buckets
%
% The search is guaranteed to terminate because table occupancy
% must be less than 1.0.
%
% The keys array is an array of maybe(K) and starts off with every slot
% initialised to `no'.
%
% The values array cannot be initialised until at least one entry is
% inserted. This is handled by checking to see whether the values
% array has non-zero size.
%
----------------------------------------------------------------------------
%
% XXX Need to check that N is not too great, either.
%
new(HashPred, N, MaxOccupancy) = HT :-
( if N =< 1 then
throw("hash_table__new_hash_table: N =< 1")
else if MaxOccupancy =< 0.0 ; 1.0 =< MaxOccupancy then
throw("hash_table__new_hash_table: MaxOccupancy not in (0.0,
1.0)")
else
NumBuckets = 1 << N,
MaxOccupants = ceiling_to_int(float(NumBuckets) * MaxOccupancy),
Keys = array__init(NumBuckets, no),
Values = array__make_empty_array,
HT = ht(NumBuckets, 0, MaxOccupants, HashPred, Keys, Values)
).
%
----------------------------------------------------------------------------
%
% These numbers are picked out of thin air.
%
new_default(HashPred) = new(HashPred, 7, 0.9).
%
----------------------------------------------------------------------------
%
% find_slot(HT, K, H, IsEmpty) looks up key K in hash table HT and
% returns the index for the entry K in H and sets IsEmpty to no if
% this is an empty slot and yes if this is the slot for a previously
% inserted value for K.
%
:- pred find_slot(hash_table(K, V), K, int, bool).
:- mode find_slot(hash_table_ui, in, out, out) is det.
find_slot(HT, K, H, IsEmpty) :-
(HT ^ hash_pred)(K, Hash1a, Hash2a),
int__abs(Hash1a, Hash1),
int__abs(Hash2a, Hash2),
H0 = Hash1 `rem` HT ^ num_buckets,
Delta = Hash2 \/ 1, % Have to ensure it's odd and
non-zero.
find_slot_0(HT, K, H0, Delta, H, IsEmpty).
%
----------------------------------------------------------------------------
%
:- pred find_slot_0(hash_table(K, V), K, int, int, int, bool).
:- mode find_slot_0(hash_table_ui, in, in, in, out, out) is det.
find_slot_0(HT, K, H0, Delta, H, IsEmpty) :-
KeySlot = HT ^ keys ^ elem(H0),
(
KeySlot = no,
H = H0,
IsEmpty = yes
;
KeySlot = yes(J),
( if J = K then
H = H0,
IsEmpty = no
else
H1 = (H0 + Delta) `rem` HT ^ num_buckets,
find_slot_0(HT, K, H1, Delta, H, IsEmpty)
)
).
%
----------------------------------------------------------------------------
%
insert(HT0, K, V) = HT :-
% If this is the first entry in the hash table, then we use it to
% set up the hash table (the arrays are currently empty because we
% need values to initialise them with).
%
( if array__size(HT0 ^ values) = 0 then
HT = insert((HT0 ^ values := array__init(HT0 ^ num_buckets, V)), K,
V)
else
find_slot(HT0, K, H, IsEmpty),
( IsEmpty = no,
HT = ( HT0 ^ values ^ elem(H) := V )
;
IsEmpty = yes,
HT =
( if HT0 ^ num_occupants = HT0 ^ max_occupants then
insert(expand(HT0), K, V)
else
((( HT0
^ keys ^ elem(H) := yes(K) )
^ values ^ elem(H) := V )
^ num_occupants := HT0 ^ num_occupants + 1 )
)
)
).
'elem :='(K, HT, V) = insert(HT, K, V).
%
----------------------------------------------------------------------------
%
search(HT, K, V) :-
find_slot(HT, K, H, IsEmpty),
IsEmpty = no,
V = HT ^ values ^ elem(H).
%
----------------------------------------------------------------------------
%
lookup(HT, K) =
( if search(HT, K, V)
then V
else throw("hash_table__lookup: key not found")
).
elem(K, HT) = lookup(HT, K).
%
----------------------------------------------------------------------------
%
delete(HT, K) =
( if find_slot(HT, K, H, yes) then HT ^ keys ^ elem(H) := no else HT ).
%
----------------------------------------------------------------------------
%
to_assoc_list(HT) =
to_assoc_list_0(0, HT ^ num_buckets, HT ^ keys, HT ^ values, []).
%
----------------------------------------------------------------------------
%
:- func to_assoc_list_0(int, int, array(maybe(K)), array(V), assoc_list(K,
V))
= assoc_list(K, V).
:- mode to_assoc_list_0(in, in, array_ui, array_ui, in) = out is det.
to_assoc_list_0(I, NumBuckets, Keys, Values, AList) =
( if I >= NumBuckets then
AList
else if Keys ^ elem(I) = yes(K) then
to_assoc_list_0(I + 1, NumBuckets, Keys, Values,
[K - Values ^ elem(I) | AList])
else
to_assoc_list_0(I + 1, NumBuckets, Keys, Values, AList)
).
%
----------------------------------------------------------------------------
%
% Hash tables expand by doubling in size.
%
:- func expand(hash_table(K, V)) = hash_table(K, V).
:- mode expand(hash_table_di) = hash_table_uo is det.
expand(HT0) = HT :-
HT0 = ht(NBs, _NOs, MOs, HP, Ks0, Vs0),
Ks = array__init(NBs + NBs, no),
Vs = array__make_empty_array,
HT1 = ht(NBs + NBs, 0, MOs + MOs, HP, Ks, Vs),
HT = reinsert_bindings(0, NBs, Ks0, Vs0, HT1).
%
----------------------------------------------------------------------------
%
:- func reinsert_bindings(int, int, array(maybe(K)), array(V), hash_table(K,
V))
= hash_table(K, V).
:- mode reinsert_bindings(in, in, array_ui, array_ui, hash_table_di)
= hash_table_uo is det.
reinsert_bindings(I, NumBuckets, Keys, Values, HT) =
( if I >= NumBuckets then
HT
else if Keys ^ elem(I) = yes(K) then
reinsert_bindings(I + 1, NumBuckets, Keys, Values,
HT ^ elem(K) := Values ^ elem(I))
else
reinsert_bindings(I + 1, NumBuckets, Keys, Values, HT)
).
%
----------------------------------------------------------------------------
%
% There are almost certainly better ones out there...
%
int_hash_pred =
( pred(N::in, H1::out, H2::out) is det :-
H1 = N,
H2 = N `xor` (N + N)
).
%
----------------------------------------------------------------------------
%
% There are almost certainly better ones out there...
%
string_hash_pred =
( pred(S::in, H1::out, H2::out) is det :-
H1 = string__hash(S),
H2 = string__foldl(func(C, N) = char__to_int(C) + N, S, 0)
).
%
----------------------------------------------------------------------------
%
%
----------------------------------------------------------------------------
%
--
Ralph Becket | MSR Cambridge | rbeck at microsoft.com
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions: mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------
More information about the developers
mailing list