SSTable and Bloom Filter
Today I learned about a technique called SSTable. SSTable is a Sorted String Table
which is a file containing a table of string in a form of key:value
sorted by a
key. It allows for very quick data read. It is often used by databases like
Cassandra or Google's BigTable. SSTable is immutable so it's read-only by design.
How do we perform writes then? Most often we have an additional memory based data
structure which performs writes (it's relatively quick because it's in RAM and not
on a disk). When it's full or some threshold is met, we flush the memory based
data structure into a SSTable on disk.
Next thing is Bloom Filter. Bloom Filter is a data structure which allows to quickly check if something is in our data set. For example, we are implementing user registration system, and we want to check if specified username is already taken or not. The idea is to have multiple hashing functions and a bit vector. When we are adding a new data to our system, then we are passing it via first hash function, and we are getting an index in the vector. We are setting this index to 1. Then we are passing our data to second hash function (different one) which will give us (probably) different index. We are setting this index to 1 again. We do the same operation for the rest of the hash functions. Next, when a query arrives, we need to pass the value through the same hash functions and check if the bits in the vector are set. If at least one bit is not set, then we can say, with 100% confidence, that this is new data, which we don't have in our data set yet. On the other hand, when all bits are set in our vector, then we can say that it's possible that this data is in our set, but we can't be sure (because of the nature of the hashing functions and collisions which are unavoidable). Such data structure is called probabilistic data structure. It's very useful technique, because bloom filter is very space and CPU efficient.