M14 · High Performance Cloud Computing
Getting started with Elastic MapReduce
Getting started with Elastic MapReduce
Hadoop is a popular framework for batch processing with the map/reduce pattern. Amazon Elastic MapReduce is a service which helps take away the heavy lifting associated with running map/reduce clusters at scale.
Read more, or view more exercises →
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. It offers a simple programming model and a mechanism for distributing computation across computational clusters. There is a reasonable amount of heavy lifting associated with building, provisioning, maintaining and running a Hadoop cluster at high scale. Elastic MapReduce helps manage Hadoop clusters with a simple web service, a friendly user interface and advanced tools for building complex job flows at scale.
Elastic MapReduce uses S3 to store input data, map and reduce code and results data. In this exercise we'll run a job flow to calculate the word count from a data set stored in S3. The input data and code is already in S3, but if you would like to run your own, Java source code is also available to help get you started.
We'll need a new S3 bucket to store the output data. Create a new bucket in S3.
More details are available in the storage exercise →
In this exercise we'll run an Elastic MapReduce job flow to calculate the word count from a large literature corpus stored on S3.
Click on 'Create New Job Flow' to start a new distributed map/reduce job.
Be sure to include your results S3 bucket name in the output location.
Click 'Continue'
The Master instance is the cluster controller group. We'll only need a single instance in this case.
The Core instance group contain the collection of instances that run the mapping and reducing functions.
If you set 'Keep alive' to 'yes', you'll need to terminate the cluster yourself. Turn it to 'no' and Elastic MapReduce will terminate the cluster automatically when the input files have been processed.
Click 'Continue', review your settings and click 'Create Job Flow' to start the Hadoop cluster and commence processing.
If you set 'Keep alive' to 'yes', your job flow status will change to 'waiting' when it has completed processing the input files. It will change to 'terminated' if keep alive was set to 'no'.
Once your job flow has completed, you can view the results in Amazon S3.
You should see a newly created results folder.
Dive into the folder and you will find your results organised by job flow and date. You can download the data, or pass it on as an input to another processing step.
After introducing Hadoop, this exercise walked through using Elastic MapReduce to provision a distributed system for large data processing.