Design a Batch processing system

There are numerous batch oriented application in place today that can leverage on-demand processing, including claims processing, large scale transformation, media trans-coding and multi-part data processing work.

Batch processing architectures are often synonymous with highly variable usage patterns that can significant usage peak ( month end processing) followed by significant period of under-utilization.

There are numerous approaches to building a batch processing architecture. In this entry we will discuss a basic batch processing architecture that supports job scheduling, job status inspection, upload raw data, outputting job result and reporting job performance data.


Users interact with the Job Manager application which is deployed on an Amazon Elastic Computer Cloud (EC2) instance. This component controls the process of accepting, scheduling, starting, managing and completing batch jobs. It also provides access to final results, job and worker statistics, and progress information.

Raw job data is uploaded to Amazon Simple Storage Service (S3), a highly-available and persistent data store.

Individual job tasks are inserted by the Job Manager in an Amazon Simple Queue Service(SQS) input queue on the user's behalf.

Worker nodes are Amazon EC2 instances deployed on an Auto-Scaling group. This group is a container that ensures health and scalability of worker nodes. Worker nodes pick up job parts from the input queue automatically and perform single tasks that are parts of the list of batch processing steps.

Interim result from worker nodes are stored in Amazon S3.

Progress information and statistics are stored on the analytics store. This component can be either an Amazon SimpleDB domain or a relational database such as an Amazon Relational Database Service(RDS) instance.

Optionally, completed tasks can be inserted in an Amazon SQS queue for chaining to a second processing stage.

Comments

Post a Comment