Don't like this style? Click here to change it! blue.css

Class 11: Hash Tables

"The three most important algorithms at Yahoo! are hashing, hashing, and hashing." -Udi Manber

Today I want to give you the big ideas of hash tables. What you hear today will require you to fill in some gaps before you can really use it but it should be enough to get you on the right track for project 2.

The Big Idea

If we're trying to implement any basic ADT we'll want to insert a key, remove a key, and search by key. In a linked-list or unsorted array looking up a key is \(\mathcal{O}(n)\), a binary search in a sorted array is \(\mathcal{O}(\log{n})\).

But if you are looking up an item in an array by index the cost is \(\mathcal{O}(1)\)!

So a hash table tries to extend the benefits of looking up by index to looking up by any key!

There are two parts to hash tables, the table (array) itself and the hash function. A hash function should be some process of mapping any possible input key (like a website, or a string, or a unsigned long) to a legitimate index (an integer from 0 to n-1).

Graphical Hash

Because the number of possible keys is far larger than the size of the hash table the other half of a hash table is deciding how you'll deal with collisions.

Basic SET using direct hashing

Design Task 1: With your group, design a simple SET class implementing insert(int n), remove(int n), and is_in(int n). It should allow only keys which are non-negative numbers from 0 to 255. It should use \(\mathcal{O}(1)\) for each operation.

Note that we didn't need a hash function for this task.

MultiSet 2: Now how do we extend this to a multi-set on the same universe of keys?

MultiMap 3: How is it different to implement a multi-map in this way? For which operations can we still achieve \(\mathcal{O}(1)\) time? Which ones are different?

A Simple Hash Function

So now we want to use this fast idea to work on larger keys. Let's say we want to implement a SET allowing any unsigned long as a key instead of just 0-255. What problems can we face?

Possible Issue #1: An array of \(2^{64}\) booleans would take up over 18 exabytes! That is 18 million terrabytes. That is a little less than twice all of google's data storage.

If you were to store that much data on punch cards it would look like this (actually 20% higher):

Thanks XKCD

And to think, you were considering using that much space for a CISC 220 in-class mini-task?!? Shame.

Possible Issue #2: If you want to get away with a much smaller array then we still want to allow any unsigned long that someone throws at us. So we need to map the keys into our shorter array.

How can we do this? What if we just toss the keys into the array in consecutive order?

Re-invent the wheel 4: If you were to deal with large keys by using a vector and pushing new keys to the back. What complexity would you get for search?

OK so that didn't work (we've looked at it before, of course). We need a hash function.

First Hash 5: Let's be optimistic for a bit. Create a "hash function" which consumes an unsigned long and returns the value mod 256. Now use an array of 256 unsigned longs to implement inserting into a (/an optimistic) SET. Insert some random integers and then inspect the array.

One-way Task 6: If you knew the value of x % m would you be able to figure out x?

First Hash Pt2 7: Now how do you plan to look up a particular key in \(\mathcal{O}(1)\)? If you wanted to run is_in(unsigned long n) could you predict the index where n would be if it were in the SET?

OK, so we get the idea of this one-way function which maps from a large set of keys to a small set of indices. We can look up a value using this one-way function by just re-hashing what we need to look up.

Next up we need to deal with reality.

Dealing With Collisions

So the last example would work for \(\mathcal{O}(1)\) lookups of our input keys, but would lose information if you happen to insert two keys with the same remainder mod 256.

Whenever two keys are hashed to the same value it's called a collision. There are careers made out of creating collisions for "cryptographic" hash functions if you're interested.

Our first strategy for dealing with collisions is to have the array be an array of nodes, each of which can act as a linked list (this is sometimes called chaining ):

Hash Chaining

Complexity Task 8: What is the best-case cost of is_in implemented using chaining as you currently understand it? What is the worst-case?

Hash Chains 9: Grab one of our old linked lists. Make an array of 256 of them each with a node storing a value of 0. Use our x % 256 hash. When a collision happens add to that linked list. When you need to test if a value is in the SET go to the correct linked list and search for the correct key. Make it so!

Improvements:

Now we can talk about two things: better hash functions and better collision handling.

Open Addressing is used if we don't want to occupy any additional memory when we end up with a collision. In this case if you want to handle a collision you go to the next open spot in the array. If you want to search for a key, hash the key and check the next indices until you find either the key you're after or a 0.

Open Chaining

Walk Through 10: Assuming that the above array was created by open chaining with a hash of key % 10, how many checks would you do to determine if 102 is in your SET? Insert 44, 50, and 72 in that order, where do they go?

To use open addressing you need two default values: one default empty value and another to represent an entry that has been deleted (for instance 0 and -1). The reason is that removed entries might leave a gap. Imagine deleting 34 from the above image then searching for 52!

Better Hashes: there are many great hashes out there. Cryptographic hashes make it so that an adversary has no information about the key that was hashed. Collision-resistant hashes for search engines don't care about that, but do want almost no collisions. Google has a hash called CityHash (details here ) which is very fast and collision resistant (it's open source which means you can study it, use it, and network with their engineers about it).

A simple string hash: A nice text-book string hash is: \( \sum_i \alpha^i ord(S[i]) \mod{m} \) which treats the string like a polynomial in \(\alpha \) (or a base \(\alpha\) integer).

A cool string hash in practice is the FNV hash.