Class 12: Industry Hash Functions

So we dealt with the ideas of a hash table, mapping from a key to an index, storing the value at that index, and dealing with collisions. Now the true performance of a hash table is going to be a function of two things: the speed of the hash function and the number of collisions you face.

Today we're going to look at real hash functions which are designed to run fast and avoid collisions. If you start looking online for hash functions you might find cryptographic hash functions which are slower than these but avoid collisions even when a user looks for them. (In the ones I show you a determined person can create many collisions even though they are rare for normal activity.) Note that many DDoS attacks are basically just overloading hash tables at DNS servers.

Murmur Hash

So this is a pretty simple hash which is fast on the CPU, does a good job mixing, and has very few collisions. It multiplies shifts and xors. Let's explore it working:

Hash Test 1: Run Murmur hash 2 on "Andy > Tom Brady". Take note of the value. Then change one letter. Then change the seed by one value. Do the values "feel" random in the range of possible unsigned ints? Now run the original again and make sure it hashes the same way.

Corner Case 2: In the following snippet some bad stuff happens. What goes wrong and why?

The API Task 3: Why does the input take in type const void *? What does the input len do? What is the importance of seed?

FNV

The next hash is really simple. The authors have published sensible starting values for offset and prime. Based on those here is my hand-made implementation:

Tuning Task 4: Take a look at the bits of the FNV_32_prime and FNV_32_offset. What is their product? What is the hash of the one byte 0? What is the hash of the one byte 1?

Mini Challenge 5: Run this hash on the numbers 0 to 50. Save the results into an array of unsigned ints. Find the smallest difference between any two numbers.

Using "Strings"

When hashing the C++ type string you have to lean on an implmentation detail. After C++11 string implementations were required to use contiguous blocks of memory. Before that there were no implementations of string that did NOT use contiguous blocks. The following snippet relies on this to hash a string:

Interpret Task 6: What do you think &*tests.begin() gives?

Own It Task 7: Adapt the snippet to hash your own C++ string.

A Hash Table with Linear Probing

Linear Probing

I'm kinda sorry about that gif…

OK, so let's make a simple SET of strings using a hash table, FNV, and open addressing.

Explore Task 8: Change the return value of myhash to always be 0 (the "Andy hash", spread the word). How does the behavior change?

Delete Task 9: If you leave myhash in inefective mode (returning 0) and change the value of deleted to "" what goes wrong? Why?

Hash Table with Chaining

So finally the singly linked list version. My linked list is probably simpler than what you've written so far, but some of the table logic is funky.

Andy Hash Task 10: Change the hash to output 0 and insert, remove, and check to test this code. (No guarantees.)

In this case I made the choice to keep an array of nodes rather than of pointers to nodes. That made my removal logic a little funky when deleting the first node at a hashed index. I also made the choice of always having an empty node at the end of each chain. This allowed the code to look an awful lot like the linear probe code (other than remove).

Logic Task 11: How does the is_in function work?

Logic Task 12: Why is deleting the first node so funky in this snippet?

Destruct Task 13: Take a look at the Node destructor. How does that function actually delete the whole chain? Why did that add to the headache of removing my nodes in myremove?

Hash Task 14: If you keep the FNV hash strong (as in not the "Andy hash") how does the behavior change?