[m-dev.] Hash table performance

Ralph Becket rafe at cs.mu.OZ.AU
Fri Jan 30 15:56:37 AEDT 2004


I've been doing some performance tests the last couple of days comparing
maps, version hash tables, and unique hash tables.  The results have
been illuminating.

The benchmark involved
- populating a map or hash table with a subset of /usr/share/dict/words,
- searching for every member of /usr/share/dict/words in a map or hash
  table containing a subset, and
- searching for every key in a map or hash table containing a subset.

The unique hash table implementation I wrote a few years back uses
closed addressing: that is, every bucket contains at most one item and
rather than a single hash, a sequence of hashes are used to locate the
right spot.

The version hash table implementation is a copy of the unique hash table
implementation, except with references to arrays and bitmaps replaced
with references to version arrays and version bitmaps.

Surprisingly, 234-trees perform amazingly well even on large amounts of
data (and they do even better if the hash table has to resize itself at
any point.)  Unique hash tables using the above scheme consistently run
15% - 120% faster than maps (they gap increases with the amount of data
being stored).  Version hash tables are actually *slower* when the data
numbers under 2000 to 4000, depending upon the test.

A spot of phorensic profiling shows why: while the hashing functions are
fine (virtually no collisions at about 50% occupancy), every version
hash table read incurrs at least three array accesses and every version
hash table write incurrs at least four array accesses and seven or eight
heap allocations.

The results of the test run are shown in the attachment Results.txt (I
know these runs should go on for longer; take the data with a pinch of
salt.)

Having observed already that maps perform extremely well on small
amounts of data, I knocked up a second version hash table implementation
using open addressing: viz an array of maps, using a single hash to get
to a bucket, then using map manipulation to do the rest.

Performance for this implementation is dramatically better, in some
cases even outperforms (closed addressing) unique hash tables, and runs
50% to 65% faster than maps when the data numbers over 1000 or so.

The lessons of the experience are:
- open addressing can be much better than closed addressing;
- for large enough data (over 100 to 400 items, say), version hash
  tables should be considered in preference to maps;
- I should add the open addressing hash table to the library.

-- Ralph
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list