Hash Tables
13.6 in your book.
Probably the
unordered_map
class in c++
Probably not
Dictionaries in python
map class in c++
An associative array
Each element has a
potentially
non integer
key
.
Each element has an associated data set.
You would like to be able to quickly access data, like in an array.
Operations (per your book)
MakeDictionary
Insert(key)
Delete(key), but this might be removed.
Lookup(key)
They point out that the universal set of keys might be enormous.
Example (from the book): You are receiving news stories.
To keep from duplicating them, you use the first paragraph as a key!
They limit the first paragraph to 1,000 characters.
In the past, I have indexed into a map with a class as the key.
It is clear that we can not allocate an array the potential size of the map.
So some capacity $c$ is selected.
Hash functions
The idea is to take a key and turn it into an integer
A simple idea for a string would be $h(s) = \sum$ ord(s[i]) $% c$
This is not the best
But it will map strings to the range [0 ... c).
We would then build an array of the data type of size c.
In c++ we would overload the [] function to take a string and run the hash function.
An example: (I stole from the Leviton book)
Let c = 13.
let the hash function be the one given above.
A => 1 fool => 9 and => 6 his => 10 money => 7 are => 11 soon => 11 parted => 12
Notice that
are
and
soon
have the same hash values.
This is called a
collision
.
What can we do about collisions?
Open Hashing (chaining)
Build a bin at each hash position.
Search the bin for the key
If it is present replace the data.
If it is not present, add the data in a new location.
Closed Hashing (sometimes called Open Addressing)
If the entry is empty, or the key matches, add the data.
If the entry is full and the key does not match move to the next available space.
This is called
linear probing
, other schemes are possible.
Open hashing has the problem that
With a bad hash function, this can turn into a linked list.
IE all of the keys map to the same location.
Closed hashing has the problems that
You can't delete, the search is predicated on the spots being full.
You can only hold $c$ unique entries. Once the table is full, it is full.
The
load factor
in open hashing
Is the ratio of elements to slots.
If this is low (about 1) then the data should be evenly distributed between the bins.
Let n be the number of entries, and m be the number of bins.
They frequently define this to be $\alpha = \frac{n}{m}$
A search where a good hash function is available is $\Theta(1+\alpha)$.
This assumes that hashing is O(1).
Note, the worst case would be $O(n)$, a linked list.
Closed hashing performance becomes worse as $\alpha$ increases.
With open hashing you can double the table size when $\alpha$ becomes too high.
Create a new table of size 2*capacity.
Rehash all entries.
This is $O(n)$, but could be amortized over all insertions, to be $O(1)$.