Big Data Analytics

Using Amazon Web Services

Hadoop's power comes from its ability to perform work on a large number of machines simultaneously. What if you want to experiment with Hadoop, but do not have many machines? While operations on a two or four-node cluster are functionally equivalent to those on a 40 or 100-node cluster, processing larger volumes of data will require a larger number of nodes.

Amazon provides machines for rent on demand through their Elastic Compute Cloud (a.k.a. EC2) service. EC2 is part of a broader set of services collectively called the Amazon Web Services, or AWS. EC2 allows you request a set of nodes ("instances" in their parlance) for as long as you need them. You pay by the instance*hour, plus costs for bandwidth. You can use EC2 instances to run a Hadoop cluster. Hadoop comes with a set of scripts which will provision EC2 instances.

The first step in this process is visit the EC2 web site (link above) and click "Sign Up For This Web Service". You will need to create an account and provide billing information. Then follow the instructions in the Getting started guide to set up your account and configure your system to run the AWS tools.

Once you have done so, follow the instructions in the Hadoop wiki specific to running Hadoop on Amazon EC2. While more details are available in the above document, the shortest steps to provisioning a cluster are:

* Edit src/contrib/ec2/bin/hadoop-ec2-env.sh to contain your Amazon account information and parameters about the desired cluster size.
* Execute src/contrib/ec2/bin/hadoop-ec2 launch-cluster.

After the cluster has been started, you can log in to the head node over ssh with the bin/hadoop-ec2 login script, and perform your MapReduce computation. When you are done, log out and type bin/hadoop-ec2 terminate-cluster to release the EC2 instances. The contents of the virtual hard drives on the instances will disappear, so be sure to copy off any important data with scp or another tool first!

A very thorough introduction to configuring Hadoop on EC2 and running a test job is provided in this article in the Amazon Web Services Developer Connection site.