Message sync protocol for messaging service

Using conventional back end systems for our messaging service like WhatsApp causes lag in performance and data usages- especially on networks with costly data plans and limited bandwidth. To fix this, we need to completely re-imagine how data is synchronized to device and change the way data is processed in the back end.

In this entry we will discuss a new sync protocol for messaging service that will decrease a non media data usage by 40% . By reducing the congestion on the network, we will see an approximately 20% decrease in the number of people who experience when trying to send messages.

Initially we started with pull based protocol for getting the data down to client. When client receives a message, it first receive a light weight push notification indicating new message is available. This trigger the app to send the server a complicated HTTPS request and receives a very large JSON response with the updated conversation view.

Instead of above model, we can move to push based snapshot and delta model. In this model client retrieve there initial snapshot of there message using HTTPS pull request and then subscribe to delta updates which are immediately pushed to the client through MQTT - a low power, low bandwidth protocol, as messages are received. As a result, without ever making the HTTPS request, the client can quickly display up-to-dated view. We can also remove the JSON based encoding for messages and delta updates.  JSON is great if we need a flexible, human readable format for transferring data without lot of developer overhead. We can replace to Apache Thrift from JSON. Switching to Thrift from JSON allows us to reduce our payload size by roughly 50%.

On server side, messaging data has traditionally been stored on spinning disks. In the pull-based model, we would write to disk before sending a trigger to Client to read from disk. This giant storage tier would serve real time message data as well as the full conversation history. But one large storage tier does not scale well to synchronize recent messages to the app in real time. In order to support synchronization with scaling we need to develop some faster sync protocol to maintain the consistency between the App Client and long term storage. To do this, we need to be able to stream the same sequence of updates in real time to App Client and to storage tier in parallel on a per user basis.

Our new Sync protocol will be totally ordered queue of messaging updates( new message, state change for messages  read etc...)  with separate pointers into the queue indicating the last update sent to your App Client and the traditional storage tier. When successfully sending a message to disk  or to your phone, the corresponding pointer will be advanced. When your phone(Client) is offline, or there is a disk outage, the pointer stays in place while new messages can still be en-queued and other pointers will advanced. As a result, long disk write latency do not hinder Client's real time communication and we can keep Client and the traditional storage tier in sync at independent rates.

Following the common sequence of operations using our brand new Message sync protocol.


1. Our ordered queue contains 5 updates and has 2 pointers:

  • The Disk pointer signals traditional disk storage is up-to-date and received update with sequence id 104.
  • The App/Client pointer indicates our App is offline and last received update was with sequence id 101.


2. Shagun sends me a new message, which is enqueued at the head of my queue and assigned a sequence id 105.


3.Then our message sync protocol sends the new message to traditional disk storage for long term persistence and the disk pointer is advanced.



4. Some time later my phone comes online, the client/App pings the queue to activate the App pointer.



5. The messing sync protocol sends all missing updates to client/App and the App pointer is advanced to indicate our App is up-to-date.



Effectively, this queue based model allows:
  • The most recent messages are immediately sent to online apps and to the disk storage tier from from protocol's memory.
  • A week's worth of messages are served by the queue's backing store in the case of disk outage or App being offline for a while.
  • Older conversation history and full inbox snapshot fetches are served from traditionally disk storage tier.

Comments