**Introduction**

In this blog post, I shall describe about Bloom filter. Bloom filter is a probabilistic data structure which trades off accuracy over space efficiency. The key operation Bloom filter provides for is membership query, that is, test whether an element is present in a given set or not. The time complexity for the membership query provided by Bloom filter is independent of the number of elements in the test and the amount of memory used by Bloom Filter to support membership query is quite small compared to amount of memory needed for storing the whole set. The membership query can return true when the element is not present in the Bloom filter. But when it returns false, then the element is is for sure not contained in the Bloom filter. Thus Bloom filter can report false positives but never false negatives. The false positives explains the probabilistic nature of the data structure.

**Algorithm and A Toy Implementation of Bloom Filter in Haskell**

The bloom filter consists of bit array of size m (b

_{1}, b_{2}, ..., b_{m}). Initially the bits of this array are all set to 0's. k independent hash functions (h_{1}(e), h_{2}(e), ..., h_{k}(e)) are defined which map to the range {1,..., m}. The key is hashed with these k hash functions. Let's look at the algorithm of the two operations that is done using the Bloom Filter.*Algorithm for 'Insertion' of Key e in Bloom Filter*

for i = 1 to k

set b[h[i]] to 1

The time complexity for insertion is proportional to k, that is, O(k) assuming hashing takes constant time and bit set operation is also constant.

*Algorithm for Membership query*

bool res=true

for i = 1 to k

if b[h[i]] == 0

res= false

break out of loop

return res

*Toy Haskell Implementation*

Below is a toy implementation of Bloom filter and it's operation in Haskell programming language. In the implementation

*,*we choose
m = 32

k = 3

The hash functions used are (x

^{2}+ x^{3})*i)%m.
where i range from [1,2,3]

Using the above, for example to insert 2010 into Bloom filter, we pass 2010 to each of the hash function which results in the values 12, 24, 4. Each of this bit position in the bit array shall be set to 1. Likewise we insert elements 2013, 2007 and 2004.An example of false positive occurs when we query for the number 3200. 3200 when hashed against each of the hash results in values 0,0,0. Bit 0 slot is already set by insertion of 2004 in the Bloom filter. So this results in a false positive.

By now, with the above implmentation it becomes clear that the number and independence of the hash functions, bit array size and the total size of the universe set shall play an important role in determing the false positive rate as well the space efficiency of the Bloom filter. In the next section we determine mathematically the probability of false positive rate and the optimal number of hash functions.

```
import Data.Bits
import Data.Word
import Data.List
type Bitarray32 = Word32
barr :: Word32
barr = 0
m = 32
k = 3
hash_fn x i = ((x^2+x^3)*i) `mod` m
insertBloom barr' k x = let
hfn = hash_fn x
bitslots= map (hfn) [1..k]
in
foldl' (\bx bs -> bx `setBit` (bs)) barr' bitslots
membQueryBloom barr' k x = let
hfn = hash_fn x
bitSlots= map (hfn) [1..k]
in
foldl' (\qt bs -> qt && (barr' `testBit` (bs))) True bitSlots
toBsl barr' = foldl' (\bsl bs -> if testBit barr' (bs) then bs:bsl else bsl) [] $ reverse [0..31]
main = do
let xs = [2010,2013,2007,2004]
putStrLn $ "Inserting elements " ++ (intercalate ", " $ map (show) xs) ++ " in Bloom filter."
let barr'' = foldl' (\bl x -> insertBloom bl k x) barr xs
putStrLn $ "Membership query for element 2010 in Bloom filter:" ++ (show $ membQueryBloom barr'' k 2010)
putStrLn $ "Membership query for element 2007 in Bloom filter:" ++ (show $ membQueryBloom barr'' k 2007)
putStrLn $ "Membership query for element 3200 in Bloom filter:(FALSE POSITIVE)" ++ (show $ membQueryBloom barr'' k 3200)
putStrLn $ "Membership query for element 2009 in Bloom filter:" ++ (show $ membQueryBloom barr'' k 2009)
putStrLn $ "Bit Slot positions in the bit arraty :" ++ (show $ toBsl barr'')
```

Output from the above program:

**thayumanavar@thayumanavar-ThinkPad-T420:~/VWB/thpv$runghc bloom1.hs**

Inserting elements 2010, 2013, 2007, 2004 in Bloom filter.

Membership query for element 2010 in Bloom filter:True

Membership query for element 2007 in Bloom filter:True

Membership query for element 3200 in Bloom filter:(FALSE POSITIVE)True

Membership query for element 2009 in Bloom filter:False

Bit Slot positions in the bit arraty :[0,4,8,10,12,14,16,24,28]

Inserting elements 2010, 2013, 2007, 2004 in Bloom filter.

Membership query for element 2010 in Bloom filter:True

Membership query for element 2007 in Bloom filter:True

Membership query for element 3200 in Bloom filter:(FALSE POSITIVE)True

Membership query for element 2009 in Bloom filter:False

Bit Slot positions in the bit arraty :[0,4,8,10,12,14,16,24,28]

*Caveats*

A natural question that arises is if I decide to go ahead and use Bloom filter in the design of a software, how do I know how much false positive it will result and how do I reduce the false positive rate to a minimum?. The key parameters that play a role in determining the false positive rate are n, m and k. But one usually doesn't have much control over n and m and they are usually over eclipsed by application and system constraints. The software designer makes a choice on number of hash functions k and these k hash functions should be fairly independent.

Maximum information density happens when the bit array is half full and based on this, the optimal number of hash function can shown to be:

k= (m /
n)*ln 2. The probability p of false positive rate is given by the following formula:

p=(1-e

^{-kn / m})^{k}.**Bloom Filter Example Usage in Cassandra**

Now let's see a scenario where Bloom filter is being used in a large-scale distributed software. Bloom filter could be used in situations where there is a large volume of data set lying on disk or behind a network service. Clearly the latency to go to underlying disk or network service and check whether data is present and fetch the data shall be quite high. At the same time, building a cache to fetch recently used data shall be expensive in terms of memory & load will be prohibitively expensive. In such situations, taking into account the false positive rate, one can opt to use Bloom filter.

Cassandra uses Bloom filter in it's read path to reduce latency of read queries. Each SSTable has a row key bloom filter which quickly checks to see if a given row is there in SSTable, thereby avoiding unnecessary expensive disk seeks. In addition, each row has column names Bloom filter to quickly whether the row contains the columns the query is looking for.

In fact Bloom filter finds it's application in softwares that runs on handheld smart phones, browser, system software, databases and all the way to distributed software that runs on the cloud. It is to be noted that Bloom filter was invented at MIT by Bloom way back in 1970s where memory was considered a scarce resource.

**Conclusion**

**In this blog post, we have looked at Bloom filter. Bloom filter is a novel data structure that is probabilistic in nature and there are a lot of design factors that one needs to take into account while making use of this data structure in a production software. In particular we have looked at how Bloom filter is used in Cassandra, a distributed database. Also we have seen a toy implementation of Bloom filter in Haskell which illustrates its working.**

My main motivation in writing this post was to study the various primitives and data structures that underlie current distributed systems as well the high-level architecture of various distributed software.

## No comments :

Post a Comment