Failure detection
In a distributed system, we need to react in case one of our nodes are down. For example, we may need to redirect requests to another server or handle data writes temporarily in another database. To be able to do that, we need a way to detect node failure.
Today I want to describe a method called Gossip Protocol. The algorithm is as follows:
- Each node maintains the list of other nodes and recent heartbeat counter of those nodes
- Each node periodically increases its own heartbeat
- Every node periodically sends the list of heartbeats to random nodes.
- When heartbeats is delivered to node, then it updates its own list and sends the heartbeats to other nodes.
- When node heartbeat is not increased for a while then it's considered down. For
example: some node
A
notices that nodeD
counter has not increased for a while. It sends its heartbeats to other nodes. When he got the response from other nodes, and he sees, that nodeD
heartbeat didn't change then it considers nodeD
as offline and sends this information to other nodes.
Now, when we already know that one of our nodes is down, we need to react to that.
In strict quorum systems, we simply block writes until the node is operational
again. This hurts availability. There is a technique called sloppy quorum in which
we select first W
healthy servers for writes and first R
healthy servers for
reads, and we ignore the unhealthy ones. Then, the selected servers handle the
traffic for the broken one, and when it will get healthy, then servers will hand
off the data back to it. This is called hinted handoff, and it's used to
handle temporary failures.
To handle permanent failures, we can use a technique called anti-entropy protocol. This technique involves comparing data on each server and keeping them in sync with each other. This way even if one of our servers goes down, then we don't need to worry about its data because have them on other some other server. The challenge here is that, often, servers contain a lot of data and comparing them can be very costly. To make it more efficient we can use a structure called Merkle Tree. This data structure is a tree which contains hashes in its nodes. Each hash represents the content of its children (which are also hashes representing content of its children and so on). This allows to quickly iterate the tree to find piece of data which differs from what we expected. Thanks to that we can synchronize the servers using only data which differs between them which is more efficient.