This post is a little more formal than usual as I wrote this for a tutorial on how to run hadoop in the clouds, but I thought this was very useful so I am posting it here for everyone's benefit (hopefully).
When CloudStack graduated from the Apache Incubator in March 2013 it joined Hadoop as a Top-Level Project (TLP) within the Apache Software Foundation (ASF). This made the ASF the only Open Source Foundation which contains a cloud platform and a big data solution. Moreover a closer look at the projects making the entire ASF shows that approximately 30% of the Apache Incubator and 10% of the TLPs is "Big Data" related. Projects such as Hbase, Hive, Pig and Mahout are sub-projects of the Hadoop TLP. Ambari, Kafka, Falcon and Mesos are part of the incubator and all based on Hadoop.
To Complement CloudStack, API wrappers such as Libcloud, deltacloud and jclouds are also part of the ASF. To connect CloudStack and Hadoop two interesting projects are also in the ASF: Apache Whirr a TLP, and Provisionr currently in incubation. Both Whirr and Provisionr aimed at providing an abstraction layer to define big data infrastructure based on Hadoop and instantiate those infrastructure on Clouds, including Apache CloudStack based clouds. This co-existence of CloudStack and the entire Hadoop ecosystem under the same Open Source Foundation means that the same governance, processes and development principles apply to both project bringing great synergy that promises an even better complementarity.
In this tutorial we introduce Apache Whirr, an application that can be used to define, provision and configure big data solutions on CloudStack based clouds. Whirr automatically starts instances in the cloud and boostrapps hadoop on them. It can also add packages such as Hive, Hbase and Yarn for map-reduce jobs.
Whirr [1] is a "set of libraries for running cloud services" and specifically big data services. Whirr is based on jclouds [2]. Jclouds is a java based abstraction layer that provides a common interface to a large set of Cloud Services and providers such as Amazon EC2, Rackspace servers and CloudStack. As such all Cloud providers supported in Jclouds are supported in Whirr. The core contributors of Whirr include four developers from Cloudera the well-known Hadoop distribution. Whirr can also be used as a command line tool, making it straightforward for users to define and provision Hadoop clusters in the Cloud.
As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating /etc/apt/sources.list.d/cloudera.list and putting the following contents in it:
deb [arch=amd64] http://archive.cloudera.com/cdh4/-cdh4 contrib
deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib
With this repository in place, one can install whirr with:
$sudo apt-get install whirr
The whirr command will now be available.
Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:
$git clone git://git.apache.org/whirr.git
They can then build their own version of whirr using maven:
$mvn install
The whirr binary will be located under the /bin directory. Adding it to one's path with:
$export PATH=$PATH:/path/to/whirr/bin
Will make the whirr command available in the user's environment. Successfull installation can be checked by simply entering:
$whirr --help
With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:
PROVIDER=cloudstack
IDENTITY=
CREDENTIAL=
ENDPOINT=
For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute
Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:
---------------------------------------
# Set the name of your hadoop cluster
whirr.cluster-name=hadoop
# Change the name of cluster admin user
whirr.cluster-user=${sys:user.name}
# Change the number of machines in the cluster here
# Below we define one hadoop namenode and 3 hadoop datanode
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker
# Specify which distribution of hadoop you want to use
# Here we choose to use the Cloudera distribution
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
# Use a specific instance type.
# Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster
whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8
# If you use ssh key pairs to access instances in the cloud
# Specify them like so
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale
whirr.public-key-file=${whirr.private-key-file}.pub
# Specify the template to use for the instances
# This is the uuid of the CloudStack template
whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2
------------------------------------------------------
To launch this Hadoop cluster use the whirr command line:
$whirr launch-cluster --config hadoop.properties
The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.
-------------------
Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437)
<< success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Get:2 http://security.ubuntu.com precise-security Release [49.6 kB]
Hit http://ch.archive.ubuntu.com precise Release.gpg
Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B]
Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B]
Hit http://ch.archive.ubuntu.com precise Release
..../snip/.....
You can log into instances using the following ssh commands:
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.xx.yy.zz
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.zz.zz.rr
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.tt.yy.uu
[hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.ii.oo.pp
-----------
To destroy the cluster from your client do:
$whirr destroy-cluster --config hadoop.properties.
Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:
$ hadoop fs -ls /
Found 5 items
drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /hadoop
drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:10 /hbase
drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:10 /mnt
drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /tmp
drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /user
Create a directory to put your input data.
$ hadoop fs -mkdir input
$ hadoop fs -ls /user/sebastiengoasguen
Found 1 items
drwxr-xr-x - sebastiengoasguen supergroup 0 2013-06-21 20:15 /user/sebastiengoasguen/input
Create a test input file and put in the hadoop file system:
$ cat foobar
this is a test to count the words
$ hadoop fs -put ./foobar input
$ hadoop fs -ls /user/sebastiengoasguen/input
Found 1 items
-rw-r--r-- 3 sebastiengoasguen supergroup 34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar
Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Start the map-reduce job:
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output
13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1
13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001
13/06/21 20:20:01 INFO mapred.JobClient: map 0% reduce 0%
13/06/21 20:20:11 INFO mapred.JobClient: map 100% reduce 0%
13/06/21 20:20:17 INFO mapred.JobClient: map 100% reduce 33%
13/06/21 20:20:18 INFO mapred.JobClient: map 100% reduce 100%
13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001
13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32
13/06/21 20:20:22 INFO mapred.JobClient: File System Counters
13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of bytes read=133
13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of bytes written=766347
13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of read operations=0
13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of write operations=0
13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of bytes read=157
13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of bytes written=50
13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of read operations=2
13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of write operations=3
13/06/21 20:20:22 INFO mapred.JobClient: Job Counters
13/06/21 20:20:22 INFO mapred.JobClient: Launched map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient: Launched reduce tasks=3
13/06/21 20:20:22 INFO mapred.JobClient: Data-local map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=10956
13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=15446
13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient: Map-Reduce Framework
13/06/21 20:20:22 INFO mapred.JobClient: Map input records=1
13/06/21 20:20:22 INFO mapred.JobClient: Map output records=8
13/06/21 20:20:22 INFO mapred.JobClient: Map output bytes=66
13/06/21 20:20:22 INFO mapred.JobClient: Input split bytes=123
13/06/21 20:20:22 INFO mapred.JobClient: Combine input records=8
13/06/21 20:20:22 INFO mapred.JobClient: Combine output records=8
13/06/21 20:20:22 INFO mapred.JobClient: Reduce input groups=8
13/06/21 20:20:22 INFO mapred.JobClient: Reduce shuffle bytes=109
13/06/21 20:20:22 INFO mapred.JobClient: Reduce input records=8
13/06/21 20:20:22 INFO mapred.JobClient: Reduce output records=8
13/06/21 20:20:22 INFO mapred.JobClient: Spilled Records=16
13/06/21 20:20:22 INFO mapred.JobClient: CPU time spent (ms)=1880
13/06/21 20:20:22 INFO mapred.JobClient: Physical memory (bytes) snapshot=469413888
13/06/21 20:20:22 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5744541696
13/06/21 20:20:22 INFO mapred.JobClient: Total committed heap usage (bytes)=207687680
And you can finally check the output:
$ hadoop fs -cat output/part-* | head
this 1
to 1
the 1
a 1
count 1
is 1
test 1
words 1
Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.
To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):
$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3
Sort it with TeraSort:
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4
And then validate the results with TeraValidate:
$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate
Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].
[1] http://whirr.apache.org
[2] http://jclouds.incubator.apache.org
[3] http://www.apache.org/dyn/closer.cgi/whirr/
[4] http://exoscale.ch
[5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html
[6] http://wiki.apache.org/hadoop/Virtual%20Hadoop
[7] https://github.com/chiradeep/stackmate