Your own High available Database - 2

In the previous entry Your own High available Database we designed a simple high available database. The design was having some risks of losing data in case of hardware failure of Master shard.  In this entry, we will add some fault tolerant to our old system.

Initially we started with a Master and multiple Slaves in a shard. Its having Single point of Failure (Master Node) for a shard.

What about peer to peer system where every shard( node) is equally privileged and any two nodes can communicate to each other.It eliminate the single point of failure, and theoretically our system will be available even in presence of dying DB machines.Dynamo and Casandra works on similar principle.

How will we shard data for peer to peer system? To add redundant data we may choose P( replication factor = P)   clockwise consecutive nodes start from the node where Data D is mapped through our consistent hashing algorithm. The client can ask any P node for the data they want. This is essential in crating high available system.

Read- Write Strategy:

Now the data is distributed in various shards. A shard is having replica of data in a ring. Client can send put() request to any node in the ring to write data. This requested node act as a coordinating node for this put() request. Coordinating node then forwards this request to the mapping nodes for data and wait for acknowledgment signal(W).When it receive an acknowledgements W  it returns write accepted signal to client. W is the minimum number of nodes from which the coordinating nodes should get an acknowledgement before making a write operation success.

For get() request client can connect to any node in the ring which again becomes the coordinating node for the request. This coordinating node asks all replica nodes for the data key and return the consolidated data to client when R of the replica are back. R  is the minimum number of nodes from which the coordinating nodes should get data value back to return to client.

R and W value are very important here. For a read to be consistent and return the previous written value we need to keep W+R >P.

We can adjust the value of W or R based on the functionality needed. Example: For very fast write we can make W = 1 and  R = P . Similarly for read heavy we can make R=1 and W=P.  In case of equal distribution we will keep both  R and W (P+1)/2.

In current design no node is node is completely responsible for a piece of data. If a node goes down it would not effect the READ/WRITE. As long as W out of P nodes are available for the key, the data will be update. In case of less than W active nodes for the data we can relax the consistency mark for availability. Same is true for read request as well.

If we keep W==P  we will be able to provide high consistency but we would not be available for WRITE even one node goes down. Therefore we will build our system for W<P. Hence our write will be propagated that means some node will have updated view of data and some will not(Inconsistency). The best we can guarantee is eventually consistency that means, in due time  all changes will be applied to all nodes and at the end all will have consistent view of data.

How can achieve eventual consistency ??

To achieve eventual consistency among data, we need to resolve difference between data on two nodes. There are different data detect and resolve conflict may arises as follow.

1. If data( Key-value) is such that value is just a single column we can use simple criteria Last write wins to resolve conflict. So if two nodes have inconsistent view of data, in the resolve step we can update node with stale data with the new data and data become consistent.

2. If the data value is composed of more than one independent column value. In this case we will store augmented data for each row indicating all the coordinating nodes for the row till now.To detect and resolve the conflict we need to compare augmented data. If one is a subset of other we can safely ignore one with smaller augmented data. Otherwise, we have conflict for our row and we need application level logic to resolve the conflict.

Comments