M14 · High Performance Cloud Computing

Getting started with Elastic MapReduce

Overview:

Hadoop is a popular framework for batch processing with the map/reduce pattern. Amazon Elastic MapReduce is a service which helps take away the heavy lifting associated with running map/reduce clusters at scale.

Read more, or view more exercises →

← Back  Next →

Introducing Hadoop

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. It offers a simple programming model and a mechanism for distributing computation across computational clusters. There is a reasonable amount of heavy lifting associated with building, provisioning, maintaining and running a Hadoop cluster at high scale. Elastic MapReduce helps manage Hadoop clusters with a simple web service, a friendly user interface and advanced tools for building complex job flows at scale.

Creating a result bucket in Amazon S3

Elastic MapReduce uses S3 to store input data, map and reduce code and results data. In this exercise we'll run a job flow to calculate the word count from a data set stored in S3. The input data and code is already in S3, but if you would like to run your own, Java source code is also available to help get you started.

We'll need a new S3 bucket to store the output data. Create a new bucket in S3.

More details are available in the storage exercise →

Creating a new Elastic MapReduce cluster

In this exercise we'll run an Elastic MapReduce job flow to calculate the word count from a large literature corpus stored on S3.

Console
We'll get started with the Elastic MapReduce web management console. Log in, and jump to the Elastic MapReduce tab.

Click on 'Create New Job Flow' to start a new distributed map/reduce job.

Console
Give your job flow a name, select 'Run a sample application' and choose 'Word Count (Streaming)' from the drop down menu.
Console
Provide the necessary job flow parameters.

Be sure to include your results S3 bucket name in the output location.

Click 'Continue'

Console
We can now specify the amount of computational infrastructure Elastic MapReduce will provision for our Hadoop cluster.

The Master instance is the cluster controller group. We'll only need a single instance in this case.

The Core instance group contain the collection of instances that run the mapping and reducing functions.

Console
Enter a keypair name if you wish to log in to the running Elastic MapReduce instances, and provide any other advanced options.

If you set 'Keep alive' to 'yes', you'll need to terminate the cluster yourself. Turn it to 'no' and Elastic MapReduce will terminate the cluster automatically when the input files have been processed.

Click 'Continue', review your settings and click 'Create Job Flow' to start the Hadoop cluster and commence processing.

At run-time

Console
You can click on a new job flow in the Elastic MapReduce dashboard, and monitor the instance count by clicking on the 'Instance groups' tab.

If you set 'Keep alive' to 'yes', your job flow status will change to 'waiting' when it has completed processing the input files. It will change to 'terminated' if keep alive was set to 'no'.

Once your job flow has completed, you can view the results in Amazon S3.

Viewing results

Console
Switch to the Amazon S3 tab in the console, and view the contents of your S3 results bucket.

You should see a newly created results folder.

Dive into the folder and you will find your results organised by job flow and date. You can download the data, or pass it on as an input to another processing step.

Console
Download one of the results files to view the corpus words, ranked alphabetically, along with their occurrence.

Exercise complete!

After introducing Hadoop, this exercise walked through using Elastic MapReduce to provision a distributed system for large data processing.

← Back  Next →

Jump to: Getting started · Slide decks · Exercises · Feedback & Help