Lambda architecture for big data handling in AWS

Lambda architecture – overview

In our previous blog we have discussed in detail the general architecture of Big data and one of the most popular architecture, namely the Lambda architecture. Now in this blog we will see how we can implement the Lambda architecture using the AWS cloud infrastructure, which is currently the most popular cloud platform in the world.

AWS and Big Data

The cloud platform provided by AWS is now one of the biggest and most used platforms for Big Data solutions.

There are a lot of features that makes AWS one of the most ideal platforms for big data solutions. Let us see some of the general features and the services in this section:

1. No hardware to manage

First things first! AWS has facilitated great environment and services so that we do not need to bother about procurements and other related stuffs on hardware.

This itself is a revolutionary achievement since the elimination of hardware and its maintenance from the picture adds greatly to the development time. This enables the team to focus better on other features.

2. Empowering developers

The frequent introduction of new technologies in to the AWS system makes another advantageous thing to think about.

This empowers the developers to try out or innovate the existing solution without any additional cost or operational expense. Also the new solution can be tested without affecting the existing system.

3. Powerful big data solutions incorporated

Coming to Big data analytics applications, AWS provides some of the most popular solutions like Hadoop as managed services.

This means we can configure the Hadoop cluster deployments to scale to the required capacity without any manual interventions and also in the minimum possible time too.

This flexibility and scalability offered is one of the primary reasons for firms switching to AWS based solutions in Big Data.

4. Fast deployments

Also one of the main attractions of AWS is the setting up and configuration time needed for Big Data application is very minimal. The services both fully managed and others from AWS takes only a fraction of time to be configured and set up compared to the traditional means. This accounts for really quick deployment times.

5. Data Pipeline service

The Data Pipeline service in AWS is a data orchestration product which ultimately helps in offloading data workloads.

This service is one of the most used services in order to effectively track the data, process it and move it to the destined locations/services. This also provides the functionality of handling the data flow according to different scenarios by enabling to write logical conditions.

6. Data warehousing made easy

There are specific services for big data which are specifically designed to address the most common challenges most big data implementations face.

The services like Kinesis provide a great solution for collecting the real time data which is of high frequency and volume. Also the AWS data warehousing services like Red Shift can handle huge amount of data.

7. Inbuilt support for ML

Machine learning is used extensively on the data stored for predictive learning. AWS has many intuitive Machine Learning services which will enable us to perform simple to complex ML tasks with ease .

8. Others

Some of the other distinguished features include the AWS Lambda, AWS Elasticsearch, Quicksight etc which are immensely popular tools/services used by the Big data enthusiasts in the AWS worlds.

Lambda architecture in AWS

There are a lot of tools in the open source world that are still the pioneers of the data handling in Big data world. These tools include like Apache Hadoop, Apache Spark, Pig, Hive etc have been here so long and has proved their mettle. What if we had the option of using them in conjunction with AWS?. And also what if AWS has some of these crucial tools supported and enhanced in their cloud environment?. Yes, this is the case and this conjunction provides great flexibility for us to handle and tame the huge amount of data we have.

Let us have a look at the Lambda architecture implementation of such a combination:

Lambda architecture

1. Distribution Layer

The data collected from the various data sources are to be ingested properly without any loss and also capable of handling huge volumes. AWS provides Kinesis streams as the solution for this. Kinesis streams is able to take in data from multiple sources and in large volumes and then distribute it to multiple locations which we can specify. Kinesis allows the data to be stored up to a maximum of 7 days and within that time period we can transfer the data. A general practice is to push the data to S3

2. Batch Layer

The batch layer comprises of the AWS S3 buckets and the AWS EMR service. The S3 bucket has the data that is coming from the Kinesis. The processing is done by the AWS EMR (Elastic MapReduce).Amazon EMR is an Apache Hadoop based service supporting the processing of very large data pool in a distributed environment.EMR finds usage in the analtics of log, data warehousing, web inding, financial data analytics, MA, simulation etc. EMR also supports workloads based on Pesto, HBase, Apache Spark.  

3. Speed Layer

Speed layer here consists of an Apache Storm cluster deployed on EC2 cluster machines. The Apache Storm is the trending tool nowadays in terms of real time data processing. It can be quite easily deployed to the Ec2 machines with minimal efforts and we will be able to get very low latency for real time data processing.

4. Serving Layer

The serving layer of this architecture consists of the data warehousing storage provided by AWS, namely the RedShift. The RedShift is an ideal tool for merging the real time and the the batch views and storing them because it is able to handle data in Petabytes range and also the data can be queried from RedShift very quickly. Another option that can be used here instead of the AWS RedShift is the AWS Athena DB.

5. Visualisation Layer

Visualisation layer can be any open source tools or the ones that AWS provides. Since we have the Apache Storm involved, the real-time view can be piped to any other endpoints without the need of going through Redshift.

Lambda Architecture in AWS – Serverless

In the previous section we have seen the implementation of the Lambda architecture with AWS using the open source tools such as Storm etc. While it is unquestionably efficient, there arises a question, whether this architecture has utilised the full potential of the AWS platform.

Of course not, AWS has the potential to be serverless and also much more refined services to handle the operations done by the likes of Storm. Let us get familiarised with such an architecture in this section.

Lambda architecture

As a quest to the complete AWS solution, we will be replacing some of the solutions in the previous model and replace with the fully managed AWS services.

1. Batch Layer

Here the EMR service is replaced by the Amazon Glue, which is created in order to process and transform data with a fully managed service. Amazon Glue is a fully managed service which is found extremely helpful with the extract, transform and load type of data operations as it mostly allows to do all of this within a few clicks.

2. Speed Layer

As an alternative to the Apache Storm for real-time stream processing, we can replace the same with two Kinesis services namely Kinesis Analysis and Kinesis firehose which allows us to process the real time stream data.

3. Visualisation Layer

The custom visualisation tools can be replaced with AWS Quicksight, especially when using with Redshift.


In this blog we have seen the different approaches of the implementation of the highly popular Lambda architecture for big data in AWS cloud. We have seen how the open source tools like Storm can be incorporated and also the alternative for the same in the AWS world. In the future blog to this architecture series, we will see how Lambda architecture is implemented using Apache Spark

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>