[m-dev.] For review: hash table implementation

Ralph Becket rbeck at microsoft.com
Thu Feb 1 03:18:53 AEDT 2001


This has been missing for too long.  It should be considerably faster
and more economical than maps in many cases (each update consumes only
6 or 7 words of storage, and only 1 if structure reuse works for it).
Of course, against that you have the fact that you're using extra space
up, but in practice occupancy should be between 45 - 90% which is on a
par with 234-trees.

Downsides: arguably more expensive for mapping and folding (I have yet
to supply these functions); weird modes are required; you have to supply
hashing predicates for each key type (anyone got a generic hash
function or two handy?)

Ralph



Estimated hours taken: 6

Added a hash table implementation to the library.

library/hash_table.m:
	Created.


Full source:
%
----------------------------------------------------------------------------
%
% hash_table.m
% Ralph Becket <rbeck at microsoft.com>
% Tue Jan 30 15:21:45 GMT 2001
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
%
% Hash table implementation.
%
% This implementation uses double hashing and requires the user to
% supply a predicate that will compute two independent hash values
% for any given key.
%
% The number of buckets in the hash table is always a power of 2.
%
% When a user set occupancy level is achieved, the number of buckets
% in the table is doubled and the previous contents reinserted into
% the new hash table.
%
%
----------------------------------------------------------------------------
%

:- module hash_table.

:- interface.

:- import_module int, assoc_list, float, string, array.



:- type hash_table(K, V).

    % XXX This is all fake until the compiler can handle nested unique
modes.
    %
:- inst hash_table ==
            bound(ht(ground, ground, ground, hash_pred, array, array)).
:- mode hash_table_ui == in(hash_table).
:- mode hash_table_di == di(hash_table).
:- mode hash_table_uo == out(hash_table).

:- type hash_pred(K) == ( pred(K, int, int) ).
:- inst hash_pred    == ( pred(in, out, out) is det ).

    % new(HashPred, N, MaxOccupancy)
    % constructs a new hash table with initial size 2 ^ N that is
    % doubled whenever MaxOccupancy is achieved; elements are
    % indexed using HashPred.
    %
    % N must be greater than 1.
    % MaxOccupancy must be in (0.0, 1.0).
    %
:- func new(hash_pred(K), int, float) = hash_table(K, V).
:- mode new(in(hash_pred), in, in) = hash_table_uo is det.

    % new_default(HashFn) constructs a hash table with default size and
    % occupancy arguments.
    %
:- func new_default(hash_pred(K)) = hash_table(K, V).
:- mode new_default(in(hash_pred)) = hash_table_uo is det.

    % Retrieve the hash_pred associated with a hash table.
    %
:- func hash_pred(hash_table(K, V)) = hash_pred(K).
:- mode hash_pred(hash_table_ui) = out(hash_pred) is det.

    % Default hash_preds for ints and strings.
    %
:- func int_hash_pred = hash_pred(int).
:- mode int_hash_pred = out(hash_pred) is det.

:- func string_hash_pred = hash_pred(string).
:- mode string_hash_pred = out(hash_pred) is det.

    % Returns the number of buckets in a hash table.
    %
:- func num_buckets(hash_table(K, V)) = int.
:- mode num_buckets(hash_table_ui) = out is det.

    % Returns the number of occupants in a hash table.
    %
:- func num_occupants(hash_table(K, V)) = int.
:- mode num_occupants(hash_table_ui) = out is det.

    % Insert an item into a hash table; if one is already there then
    % it is overwritten.
    %
:- func insert(hash_table(K, V), K, V) = hash_table(K, V).
:- mode insert(hash_table_di, in, in) = hash_table_uo is det.

:- func 'elem :='(K, hash_table(K, V), V) = hash_table(K, V).
:- mode 'elem :='(in, hash_table_di, in) = hash_table_uo is det.

    % Delete the entry (if any) for the given key.
    %
:- func delete(hash_table(K, V), K) = hash_table(K, V).
:- mode delete(hash_table_di, in) = hash_table_uo is det.

    % Lookup the value associated with the given key.  An exception
    % is raised if there is no entry for the key.
    %
:- func lookup(hash_table(K, V), K) = V.
:- mode lookup(hash_table_ui, in) = out is det.

:- func elem(K, hash_table(K, V)) = V.
:- mode elem(in, hash_table_ui) = out is det.

    % Like lookup, but just fails if there is no entry for the key.
    %
:- pred search(hash_table(K, V), K, V).
:- mode search(hash_table_ui, in, out) is semidet.

    % Convert a hash table into an association list.
    %
:- func to_assoc_list(hash_table(K, V)) = assoc_list(K, V).
:- mode to_assoc_list(hash_table_ui) = out is det.

%
----------------------------------------------------------------------------
%
%
----------------------------------------------------------------------------
%

:- implementation.

:- import_module exception, math, std_util, bool, exception, char, list.



:- type hash_table(K, V) 
    --->    ht(
                num_buckets             :: int,
                num_occupants           :: int,
                max_occupants           :: int,
                hash_pred               :: hash_pred(K),
                keys                    :: array(maybe(K)),
                values                  :: array(V)
            ).

%
----------------------------------------------------------------------------
%

    % THE HASHING SCHEME
    %
    % The user provided hashing function computes two independent
    % hashes H1 and H2 for a given key K.
    %
    % We calculate D = H2 \/ 1 (to ensure that D is non-zero and
    % odd and is therefore coprime to the number of buckets) and
    % probe the table where the Kth probe examines slot
    %
    %   (H1 + K * H2) `rem` num_buckets
    %
    % The search is guaranteed to terminate because table occupancy
    % must be less than 1.0.
    %
    % The keys array is an array of maybe(K) and starts off with every slot
    % initialised to `no'.
    %
    % The values array cannot be initialised until at least one entry is
    % inserted.  This is handled by checking to see whether the values
    % array has non-zero size.

%
----------------------------------------------------------------------------
%

    % XXX Need to check that N is not too great, either.
    %
new(HashPred, N, MaxOccupancy) = HT :-
    (      if N =< 1 then
            throw("hash_table__new_hash_table: N =< 1")
      else if MaxOccupancy =< 0.0 ; 1.0 =< MaxOccupancy  then
            throw("hash_table__new_hash_table: MaxOccupancy not in (0.0,
1.0)")
      else
            NumBuckets   = 1 << N,
            MaxOccupants = ceiling_to_int(float(NumBuckets) * MaxOccupancy),
            Keys         = array__init(NumBuckets, no),
            Values       = array__make_empty_array,
            HT = ht(NumBuckets, 0, MaxOccupants, HashPred, Keys, Values)
    ).

%
----------------------------------------------------------------------------
%

    % These numbers are picked out of thin air.
    %
new_default(HashPred) = new(HashPred, 7, 0.9).

%
----------------------------------------------------------------------------
%

    % find_slot(HT, K, H, IsEmpty) looks up key K in hash table HT and
    % returns the index for the entry K in H and sets IsEmpty to no if
    % this is an empty slot and yes if this is the slot for a previously
    % inserted value for K.
    %
:- pred find_slot(hash_table(K, V), K, int, bool).
:- mode find_slot(hash_table_ui, in, out, out) is det.

find_slot(HT, K, H, IsEmpty) :-
    (HT ^ hash_pred)(K, Hash1a, Hash2a),
    int__abs(Hash1a, Hash1),
    int__abs(Hash2a, Hash2),
    H0    = Hash1 `rem` HT ^ num_buckets,
    Delta = Hash2 \/ 1,                 % Have to ensure it's odd and
non-zero.
    find_slot_0(HT, K, H0, Delta, H, IsEmpty).

%
----------------------------------------------------------------------------
%

:- pred find_slot_0(hash_table(K, V), K, int, int, int, bool).
:- mode find_slot_0(hash_table_ui, in, in, in, out, out) is det.

find_slot_0(HT, K, H0, Delta, H, IsEmpty) :-
    KeySlot = HT ^ keys ^ elem(H0),
    (
        KeySlot = no,
        H       = H0,
        IsEmpty = yes
    ;
        KeySlot = yes(J),
        ( if J = K then
            H = H0,
            IsEmpty = no
          else
            H1 = (H0 + Delta) `rem` HT ^ num_buckets,
            find_slot_0(HT, K, H1, Delta, H, IsEmpty)
        )
    ).

%
----------------------------------------------------------------------------
%

insert(HT0, K, V) = HT :-

        % If this is the first entry in the hash table, then we use it to
        % set up the hash table (the arrays are currently empty because we
        % need values to initialise them with).
        %
    ( if array__size(HT0 ^ values) = 0 then
        HT = insert((HT0 ^ values := array__init(HT0 ^ num_buckets, V)), K,
V)
      else
        find_slot(HT0, K, H, IsEmpty),
        (   IsEmpty = no,
            HT = ( HT0 ^ values ^ elem(H) := V )
        ;
            IsEmpty = yes,
            HT =
                ( if HT0 ^ num_occupants = HT0 ^ max_occupants then
                    insert(expand(HT0), K, V)
                  else
                    ((( HT0
                            ^ keys ^ elem(H)   := yes(K) )
                            ^ values ^ elem(H) := V )
                            ^ num_occupants    := HT0 ^ num_occupants + 1 )
                )
        )
    ).

'elem :='(K, HT, V) = insert(HT, K, V).

%
----------------------------------------------------------------------------
%

search(HT, K, V) :-
    find_slot(HT, K, H, IsEmpty),
    IsEmpty = no,
    V = HT ^ values ^ elem(H).

%
----------------------------------------------------------------------------
%

lookup(HT, K) =
    ( if search(HT, K, V)
      then V
      else throw("hash_table__lookup: key not found")
    ).

elem(K, HT) = lookup(HT, K).

%
----------------------------------------------------------------------------
%

delete(HT, K) =
    ( if find_slot(HT, K, H, yes) then HT ^ keys ^ elem(H) := no else HT ).

%
----------------------------------------------------------------------------
%

to_assoc_list(HT) =
    to_assoc_list_0(0, HT ^ num_buckets, HT ^ keys, HT ^ values, []).

%
----------------------------------------------------------------------------
%

:- func to_assoc_list_0(int, int, array(maybe(K)), array(V), assoc_list(K,
V))
            = assoc_list(K, V).
:- mode to_assoc_list_0(in, in, array_ui, array_ui, in) = out is det.

to_assoc_list_0(I, NumBuckets, Keys, Values, AList) =
    ( if I >= NumBuckets then
        AList
      else if Keys ^ elem(I) = yes(K) then
        to_assoc_list_0(I + 1, NumBuckets, Keys, Values,
                [K - Values ^ elem(I) | AList])
      else
        to_assoc_list_0(I + 1, NumBuckets, Keys, Values, AList)
    ).

%
----------------------------------------------------------------------------
%

    % Hash tables expand by doubling in size.
    %
:- func expand(hash_table(K, V)) = hash_table(K, V).
:- mode expand(hash_table_di) = hash_table_uo is det.

expand(HT0) = HT :-
    HT0 = ht(NBs, _NOs, MOs, HP, Ks0, Vs0),
    Ks  = array__init(NBs + NBs, no),
    Vs  = array__make_empty_array,
    HT1 = ht(NBs + NBs, 0, MOs + MOs, HP, Ks, Vs),
    HT  = reinsert_bindings(0, NBs, Ks0, Vs0, HT1).

%
----------------------------------------------------------------------------
%

:- func reinsert_bindings(int, int, array(maybe(K)), array(V), hash_table(K,
V))
            = hash_table(K, V).
:- mode reinsert_bindings(in, in, array_ui, array_ui, hash_table_di)
            = hash_table_uo is det.

reinsert_bindings(I, NumBuckets, Keys, Values, HT) =
    ( if I >= NumBuckets then
        HT
      else if Keys ^ elem(I) = yes(K) then
        reinsert_bindings(I + 1, NumBuckets, Keys, Values,
                HT ^ elem(K) := Values ^ elem(I))
      else
        reinsert_bindings(I + 1, NumBuckets, Keys, Values, HT)
    ).

%
----------------------------------------------------------------------------
%

    % There are almost certainly better ones out there...
    %
int_hash_pred =
    ( pred(N::in, H1::out, H2::out) is det :-
        H1 = N,
        H2 = N `xor` (N + N)
    ).

%
----------------------------------------------------------------------------
%

    % There are almost certainly better ones out there...
    %
string_hash_pred =
    ( pred(S::in, H1::out, H2::out) is det :-
        H1 = string__hash(S),
        H2 = string__foldl(func(C, N) = char__to_int(C) + N, S, 0)
    ).

%
----------------------------------------------------------------------------
%
%
----------------------------------------------------------------------------
%

--
Ralph Becket      |      MSR Cambridge      |      rbeck at microsoft.com 
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list