Design a high available and fault tolerant system

Most of the higher level services provided by Cloud provider like Amazon (Amazon Simple Storage Service-S3, Amazon SimpleDB, Amazon Simple Queue service-SQS, Amazon Elastic Load Balancing-ELB etc) have been built with fault tolerance and high availability in mind. Services that that provides basic infrastructure, such as Amazon Elastic Compute Cloud(EC2) and Amazon Elastic Block Storage(EBS), provides specific features, such as availability zones, elastic IP addresses, and snapshots that a fault tolerant and high available system must take advantage of it and use it correctly. Just moving into the cloud does not make system fault-tolerant or high available.

In this entry we will discuss a design of fault tolerant and high available system.

Design Overview:
Load Balancing is an effective way to increase the availability of a system. Instances that fail can be replaced seamlessly behind the load balancing system while other instance continue to operate. Elastic Load Balancing can be used to balance across instances in multiple availability zones of a regions.

Amazon EC2 is hosted in multiple locations world wide. The locations are composed of regions and Availability zones. Each region is a separate geographical area. Each region has multiple, isolated locations known as Availability zones. Amazon EC2 provides you the ability to place resources, such as instances and data in multiple locations. Resources are not replicated across regions unless you do so specifically.

Availability zones(AZs) are distinct geographical locations that are engineered to be insulated from failures in other availability zones. By placing Amazon EC2 instances in multiple availability zones, an application can be protected from failure at a single location. It is important to run independent application stack in more than one AZ , either in same region or another region.If the application fails in one zone application continue running in another zone. When you design such system you need to understand the zone dependencies.


Elastic IP addresses are public IP addresses that can be problematically mapped between instances within a region. They are associated with AWS account and not with specific instance or lifetime of instance. 

Elastic IP can be used to work around host or availability zone failure by quickly remapping the address to another running instance or replacement instance that's just started. Reserved instances can help guarantee that such capability is available in another zone.

Valuable data will not be stored stored only on instance storage without proper backups, replications or ability to recreate the data. Amazon Elastic Block Storage (EBS) offers persistent off-instance storage volume that are about an order of magnitude more durable than the on-instance storage. EBS volumes are automatically replicated within a single availability zone. To increase durability further, point in time snapshot can be created to store data on volumes in Amazon S3, which is then replicated to multiple AZs. While EBS volume are tied to a specific AZ, snapshot is tied to a region. Using a snapshot, you can create new EBS volume in any of the AZs of same region. This is an effective way to deal with disk failure or other host level issues, as well as with the problems effecting the AZ. Snapshot are incremental, so its better to hold on to recent snapshots.

This is all about our high available and fault tolerant system. Keep Reading!!



Comments