Don't like this style? Click here to change it! blue.css

Class 13: Bloom Filters and Hash Play

So today I want to do yet another wild experiment in software engineering. Build a spell-checker!

You're going to play with a probabilistic data structure called a Bloom filter (named after Burton H. Bloom).

Today's Objectives (this is HW3 BTW)

You will build a spell checker using a bloom filter. You should insert every word in the dictionary into your bloom filter. You should then test words in the dictionary and words not in the dictionary. This data structure is fast and memory efficient but it can give false positives (says a word is in the dictionary when it really isn't). Once you have a basic version working tweak the parameters so that the false positives are rare.

A dictionary can be found at (BUT DON'T OPEN IT JUST wget the URL) https://raw.githubusercontent.com/cisc220/bloom-filter/master/wrds.txt

The way a bloom filter works is like this: Hash each word with several distinct hash functions (Murmur with several distinct seeds). Store a 1 at each of those indices in an array of bits (if you use bytes that's ok too but not as memory efficient). When you go to check if a word has been previously inserted then check the same hash functions on the new word, if each of those indices has a 1 then say it is in there otherwise 0.

Learn by example at http://billmill.org/bloomfilter-tutorial/

Think it out: can the bloom filter tell you that a word is not in the dictionary when it actually has been inserted before?

Why?

This is a data structure which is all about speed. It is so fast that in some situations it can't even be trusted! Many companies and technologies quietly lean on bloom filters to reject bad queries before they are sent to more expensive processing. Bitcoin, Google, Apache, and other distributed information retrieval systems use Bloom filters as, at least, a first pass.

In my career I've been faced with the speed/certainty trade-off many times, so I think it's healthy to feel that out.

The next benefit is that this project is a direct follow-up to Hash Table Tuesday. You should use one of the hash functions that let's you make a family of different results. Try Murmur Hash with several seeds.

How?

Here is the big idea:

Thought experiment: if all values were 1 then you would always say that every key is in your bloom filter.

What?

Here is a code Kata which outlines your objective: http://codekata.com/kata/kata05-bloom-filters/. It will provide you with a word list. Your team should build a spell checker.

Here is a statistical analysis of the false-positive rates for various numbers of hashes and array sizes,

These problems act much like project Euler problems, in that they are great ways to build your coding skills by intentional practice. However they aren't as heavily number theory based.

Groups which turn this in with their homework will get some extra credit on that assignment.

To help you, here is a version I wrote for you. Mine uses Multiplicative Hashing. I would like you to try completely different hash functions and just use mine whenever you get stuck or need a little extra insight. Mine also uses a binary file to store the bloom filter, partly for demonstration of the sizes involved.

You'll have to work in Cloud9 for this, because we're dealing with some file input/output for the first time really.

Hash Functions

Murmur Hash: https://sites.google.com/site/murmurhash/

Comparison of many: http://burtleburtle.net/bob/hash/doobs.html including 14 hash functions side by side.

Wikipedia List: https://en.wikipedia.org/wiki/List_of_hash_functions