URL Shortening service like, bit.ly

In this entry we will design a URL shortening service. Lets begin with list of features.


Features:

URL shortening service will take a URL and return a much smaller URL. This short URL will have the capability to redirect to original URL. User can also pick the custom URL. What if two user try to shorten the same URL.

Assume 100 Million URL new URLs added each month to the system. Average lifetime of a shortened URL is 2 weeks and as per rule 20% of URLs creates 80% of traffic. So as per assumption, we may receive around 1 Billion queries in a month. ~400 queries per second == 360 read requests and 40 write requests. In 5 years we will need to shorten 100 Million * 12*5  ==6 Billion.

We will use (a-z A-Z 0-9) to encode our URL. Lets x be the minimum number of characters used to represent 6 Billion URLs, then x will be smallest integer such that x^62  > 6* 10^9 (where 62 are total characters including a-z A-Z 0-9) == log(6*10^9) to base 62 ==6. So x=6.

Data flow for read write request would contain a shortened URL and original URL. Shortened URL will take 6 bytes whereas the original URL will take 500 bytes.

Based on above data, Write data request per second  = 40*(500+6)
                                    Read data request per second  = 260*(500+6) bytes.



URL shortening service is latency sensitive. High latency on URL shorten er is as good as failure to resolve the request. Shortening service should have high consistency and availability.

Design Strategy:

We will define two APIs :

shortURL(url) - Store the URL mapping and return hash(url).
redirectURL(hash) - Redirect to URL_MAPPING[hash]

Basically we try to build a system that serve a huge HashMap.

High level design for write operation is as follow



And the high level design for read operation as similarly, as follow


We can use list of salt to handle the case where two separate URL get shorten to same URL(collision). Directly encoding URL to base_62 will allow a user to check if a URL has been shortened already or not, reverse engineering can lead the user to the exact same function used, this should be avoided. To avoid this, we will introduce randomization If two users shortened the same URL, it should result in two separate shortened URL . Encoding to base_62 also would not be suitable for a production environment because it will leak information about the database. For example, patterns can be learnt to compute the growth rate (new URLs added per day) and in worst case, can copy the whole database.

To add high availability, simplest thing we could do is to have multiple stateless app servers. They do not store any data(stateless) and all of them behave exact same way when up. If one of them goes down, we still have other app server to keep site running.

Load balancer is a black box that store meta data about the app server and delegate the client request to least overloaded active server. 

Our URL shortening service is simple key-value lookup (Given the hash, give me the corresponding URL). IF the size of the data is too small to fits on a single machine's main memory SQL with indexing is clear winner which has next to zero maintenance. If your index fits in RAM, its best to go with standard SQL solution. In our case, 

Write request per month = 100 Million
Average size of URL = 500 bytes.
System Provisioning for = ~5years
Implies, Space Required : 500bytes * 100M* 12*5 = 3TB 

3TB data can fit on single machine's hard disk, but index might not. Storing all data on single machine will make our read write operations very slow. (Page swap for indices) because when when we make access to SQL index frequently, that SQL index gets cached in RAM as a page. If we host the system on a single server machine due to large number of page swapping and limited RAM, page swaps(disk traversal) will increase and thus performance will degrade.  Given that the read latency is critical for shortening service we can not store the data on a single machine. So, A SQL solution will have sharding overhead. However, most NoSQL solution are built with the assumption that data does not fit on a single machine and hence support sharding builtin.

Since we are building a consistent shortening service with high availability, we will choose a NoSQL database like HBase.

We will discuss the database schema and other details for shortening service in subsequent entry. See you soon.


Comments