Hash Tables

13.6 in your book.
Probably the unordered_map class in c++
Probably not
- Dictionaries in python
- map class in c++
An associative array
- Each element has a potentially non integer key.
- Each element has an associated data set.
You would like to be able to quickly access data, like in an array.
Operations (per your book)
- MakeDictionary
- Insert(key)
- Delete(key), but this might be removed.
- Lookup(key)
They point out that the universal set of keys might be enormous.
- Example (from the book): You are receiving news stories.
  - To keep from duplicating them, you use the first paragraph as a key!
  - They limit the first paragraph to 1,000 characters.
- In the past, I have indexed into a map with a class as the key.
It is clear that we can not allocate an array the potential size of the map.
- So some capacity $c$ is selected.
Hash functions
- The idea is to take a key and turn it into an integer
- A simple idea for a string would be $h(s) = \sum$ ord(s[i]) $% c$
  - This is not the best
  - But it will map strings to the range [0 ... c).
We would then build an array of the data type of size c.
- In c++ we would overload the [] function to take a string and run the hash function.
An example: (I stole from the Leviton book)
- Let c = 13.
- let the hash function be the one given above.
```
     A => 1
  fool => 9
   and => 6
   his => 10
 money => 7
   are => 11
  soon => 11
parted => 12
         
```
- Notice that are and soon have the same hash values.
- This is called a collision.
What can we do about collisions?
- Open Hashing (chaining)
  - Build a bin at each hash position.
  - Search the bin for the key
    - If it is present replace the data.
    - If it is not present, add the data in a new location.
- Closed Hashing (sometimes called Open Addressing)
  - If the entry is empty, or the key matches, add the data.
  - If the entry is full and the key does not match move to the next available space.
  - This is called linear probing, other schemes are possible.
Open hashing has the problem that
- With a bad hash function, this can turn into a linked list.
- IE all of the keys map to the same location.
Closed hashing has the problems that
- You can't delete, the search is predicated on the spots being full.
- You can only hold $c$ unique entries. Once the table is full, it is full.
The load factor in open hashing
- Is the ratio of elements to slots.
- If this is low (about 1) then the data should be evenly distributed between the bins.
- Let n be the number of entries, and m be the number of bins.
- They frequently define this to be $\alpha = \frac{n}{m}$
- A search where a good hash function is available is $\Theta(1+\alpha)$.
- This assumes that hashing is O(1).
- Note, the worst case would be $O(n)$, a linked list.
Closed hashing performance becomes worse as $\alpha$ increases.
With open hashing you can double the table size when $\alpha$ becomes too high.
- Create a new table of size 2*capacity.
- Rehash all entries.
- This is $O(n)$, but could be amortized over all insertions, to be $O(1)$.