InterviewMantra - Help you to get Dream Job!!: tech inteview questions

Tuesday, 25 October 2016

URL Shortening service like, bit.ly - 2

This is the continuation of URL shortening design. In this entry we will explore different expect of our proposed design.

In the proposed design, we are optimizing for consistency over availability and we agreed to choose a NoSQL database which is highly consistent like HBase or Cassandra. HBase maintains a table called metatable, that points which key is present at which machines. HBase distribute keys based on the load on machines, they called it region servers. Cassandra inherently use consistency hashing for sharding. Since total data in our case is few TBs, we don't need to think about sharding across data centers. NoSQL has single machine failure handling inbuilt. A NoSql database like HBase would have availability issue for a few seconds for the effected keys, as it tries to maintain strong consistency.

We can also use LRU caching strategy to add performance since most URL's are accessed for only a small amount of time after creation. Also, if the shortened URL goes viral, serving it through cache will reduce the chances of overloading the database. We can go with Redis or Memcached for caching.

To shard the data across databases, we can use integer encoding of the shortened URL to distribute data among our database shards. Let's we assign values from 0 to 61 to characters a to z ,A to Z and 0 to 9, we can compute the integer encoding of the shortened URL. We will use consistent hashing to ensure we do not have to rehash all data if we add a new shard in future.

To add fault tolerance, we can use a scheme called Master-slave scheme for shards, wherein one machine called master process all writes and there are slaves machines which just subscribe to all the writes and keep updating itself. In an event when the master goes down, one of the slave can take over and start responding to the write queries. The slaves are to serve read requests.

Saturday, 8 October 2016

URL Shortening service like, bit.ly

In this entry we will design a URL shortening service. Lets begin with list of features.

Features:

URL shortening service will take a URL and return a much smaller URL. This short URL will have the capability to redirect to original URL. User can also pick the custom URL. What if two user try to shorten the same URL.

Assume 100 Million URL new URLs added each month to the system. Average lifetime of a shortened URL is 2 weeks and as per rule 20% of URLs creates 80% of traffic. So as per assumption, we may receive around 1 Billion queries in a month. ~400 queries per second == 360 read requests and 40 write requests. In 5 years we will need to shorten 100 Million * 12*5 ==6 Billion.

We will use (a-z A-Z 0-9) to encode our URL. Lets x be the minimum number of characters used to represent 6 Billion URLs, then x will be smallest integer such that x^62 > 6* 10^9 (where 62 are total characters including a-z A-Z 0-9) == log(6*10^9) to base 62 ==6. So x=6.

Data flow for read write request would contain a shortened URL and original URL. Shortened URL will take 6 bytes whereas the original URL will take 500 bytes.

Based on above data, Write data request per second = 40*(500+6)
Read data request per second = 260*(500+6) bytes.

URL shortening service is latency sensitive. High latency on URL shorten er is as good as failure to resolve the request. Shortening service should have high consistency and availability.

Design Strategy:

We will define two APIs :

shortURL(url) - Store the URL mapping and return hash(url).
redirectURL(hash) - Redirect to URL_MAPPING[hash]

Basically we try to build a system that serve a huge HashMap.

High level design for write operation is as follow

And the high level design for read operation as similarly, as follow

We can use list of salt to handle the case where two separate URL get shorten to same URL(collision). Directly encoding URL to base_62 will allow a user to check if a URL has been shortened already or not, reverse engineering can lead the user to the exact same function used, this should be avoided. To avoid this, we will introduce randomization If two users shortened the same URL, it should result in two separate shortened URL . Encoding to base_62 also would not be suitable for a production environment because it will leak information about the database. For example, patterns can be learnt to compute the growth rate (new URLs added per day) and in worst case, can copy the whole database.

To add high availability, simplest thing we could do is to have multiple stateless app servers. They do not store any data(stateless) and all of them behave exact same way when up. If one of them goes down, we still have other app server to keep site running.

Load balancer is a black box that store meta data about the app server and delegate the client request to least overloaded active server.

Our URL shortening service is simple key-value lookup (Given the hash, give me the corresponding URL). IF the size of the data is too small to fits on a single machine's main memory SQL with indexing is clear winner which has next to zero maintenance. If your index fits in RAM, its best to go with standard SQL solution. In our case,

Write request per month = 100 Million

Average size of URL = 500 bytes.

System Provisioning for = ~5years

Implies, Space Required : 500bytes * 100M* 12*5 = 3TB

3TB data can fit on single machine's hard disk, but index might not. Storing all data on single machine will make our read write operations very slow. (Page swap for indices) because when when we make access to SQL index frequently, that SQL index gets cached in RAM as a page. If we host the system on a single server machine due to large number of page swapping and limited RAM, page swaps(disk traversal) will increase and thus performance will degrade. Given that the read latency is critical for shortening service we can not store the data on a single machine. So, A SQL solution will have sharding overhead. However, most NoSQL solution are built with the assumption that data does not fit on a single machine and hence support sharding builtin.

Since we are building a consistent shortening service with high availability, we will choose a NoSQL database like HBase.

We will discuss the database schema and other details for shortening service in subsequent entry. See you soon.