Thursday, December 19, 2013

Clojure with CloudStack

CloStack

CloStack is a Clojure client for Apache CloudStack. Clojure is a dynamic programming language for the Java Virtual Machine (JVM). It is compiled directly in JVM bytecode but offers a dynamic and interactive nature of an interpreted language like Python. Clojure is a dialect of LISP and as such is mostly a functional programming language.

You can try Clojure in your browser and get familiar with its read eval loop (REPL). To get started, you can follow the tutorial for non-LISP programmers through this web based REPL.

To give you a taste for it, here is how you would 2 and 2:

user=> (+ 2 2)
4

And how you would define a function:

user=> (defn f [x y]
  #_=> (+ x y))
#'user/f
user=> (f 2 3)
5

This should give you a taste of functional programming :)

Install Leinigen

leiningen is a tool for managing Clojure projects easily. With lein you can create the skeleton of clojure project as well as start a read eval loop (REPL) to test your code.

Installing the latest version of leiningen is easy, get the script and set it in your path. Make it executable and your are done.

The first time your run lein repl it will boostrap itself:

$ lein repl
Downloading Leiningen to /Users/sebgoa/.lein/self-installs/leiningen-2.3.4-standalone.jar now...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.0M  100 13.0M    0     0  1574k      0  0:00:08  0:00:08 --:--:-- 2266k
nREPL server started on port 58633 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> exit
Bye for now!

Download CloStack

To install CloStack, clone the github repository and start lein repl:

git clone https://github.com/pyr/clostack.git
$ lein repl
Retrieving codox/codox/0.6.4/codox-0.6.4.pom from clojars
Retrieving codox/codox.leiningen/0.6.4/codox.leiningen-0.6.4.pom from clojars
Retrieving leinjacker/leinjacker/0.4.1/leinjacker-0.4.1.pom from clojars
Retrieving org/clojure/core.contracts/0.0.1/core.contracts-0.0.1.pom from central
Retrieving org/clojure/core.unify/0.5.3/core.unify-0.5.3.pom from central
Retrieving org/clojure/clojure/1.4.0/clojure-1.4.0.pom from central
Retrieving org/clojure/core.contracts/0.0.1/core.contracts-0.0.1.jar from central
Retrieving org/clojure/core.unify/0.5.3/core.unify-0.5.3.jar from central
Retrieving org/clojure/clojure/1.4.0/clojure-1.4.0.jar from central
Retrieving codox/codox/0.6.4/codox-0.6.4.jar from clojars
Retrieving codox/codox.leiningen/0.6.4/codox.leiningen-0.6.4.jar from clojars
Retrieving leinjacker/leinjacker/0.4.1/leinjacker-0.4.1.jar from clojars
Retrieving org/clojure/clojure/1.3.0/clojure-1.3.0.pom from central
Retrieving org/clojure/data.json/0.2.2/data.json-0.2.2.pom from central
Retrieving http/async/client/http.async.client/0.5.2/http.async.client-0.5.2.pom from clojars
Retrieving com/ning/async-http-client/1.7.10/async-http-client-1.7.10.pom from central
Retrieving io/netty/netty/3.4.4.Final/netty-3.4.4.Final.pom from central
Retrieving org/clojure/data.json/0.2.2/data.json-0.2.2.jar from central
Retrieving com/ning/async-http-client/1.7.10/async-http-client-1.7.10.jar from central
Retrieving io/netty/netty/3.4.4.Final/netty-3.4.4.Final.jar from central
Retrieving http/async/client/http.async.client/0.5.2/http.async.client-0.5.2.jar from clojars
nREPL server started on port 58655 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> exit

The first time that you start the REPL lein will download all the dependencies of clostack.

Prepare environment variables and make your first clostack call

Export a few environmen variables to define the cloud you will be using, namely:

export CLOUDSTACK_ENDPOINT=http://localhost:8080/client/api
export CLOUDSTACK_API_KEY=HGWEFHWERH8978yg98ysdfghsdfgsagf
export CLOUDSTACK_API_SECRET=fhdsfhdf869guh3guwghseruig

Then relaunch the REPL

$lein repl
nREPL server started on port 59890 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> (use 'clostack.client)
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
nil

You can safely discard the warning message which only indicates that 'clostack' is meant to be used as a library in a clojure project.
Define a client to your CloudStack endpoint

user=> (def cs (http-client))
#'user/cs

And call an API like so:

user=> (list-zones cs)
{:listzonesresponse {:count 1, :zone [{:id "1128bd56-b4d9-4ac6-a7b9-c715b187ce11", :name "CH-GV2", :networktype "Basic", :securitygroupsenabled true, :allocationstate "Enabled", :zonetoken "ccb0a60c-79c8-3230-ab8b-8bdbe8c45bb7", :dhcpprovider "VirtualRouter", :localstorageenabled true}]}}

To explore the API calls that you can make, the REPL features tab completion. Enter list or de and press the tab key you should see:

user=> (list
list                                list*                               list-accounts                       list-async-jobs                     
list-capabilities                   list-disk-offerings                 list-event-types                    list-events                         
list-firewall-rules                 list-hypervisors                    list-instance-groups                list-ip-forwarding-rules            
list-iso-permissions                list-isos                           list-lb-stickiness-policies         list-load-balancer-rule-instances   
list-load-balancer-rules            list-network-ac-ls                  list-network-offerings              list-networks                       
list-os-categories                  list-os-types                       list-port-forwarding-rules          list-private-gateways               
list-project-accounts               list-project-invitations            list-projects                       list-public-ip-addresses            
list-remote-access-vpns             list-resource-limits                list-security-groups                list-service-offerings              
list-snapshot-policies              list-snapshots                      list-ssh-key-pairs                  list-static-routes                  
list-tags                           list-template-permissions           list-templates                      list-virtual-machines               
list-volumes                        list-vp-cs                          list-vpc-offerings                  list-vpn-connections                
list-vpn-customer-gateways          list-vpn-gateways                   list-vpn-users                      list-zones                          
list?

user=> (de
dec                           dec'                          decimal?                      declare                       def                           
default-data-readers          definline                     definterface                  defmacro                      defmethod                     
defmulti                      defn                          defn-                         defonce                       defprotocol                   
defrecord                     defreq                        defstruct                     deftype                       delay                         
delay?                        delete-account-from-project   delete-firewall-rule          delete-instance-group         delete-ip-forwarding-rule     
delete-iso                    delete-lb-stickiness-policy   delete-load-balancer-rule     delete-network                delete-network-acl            
delete-port-forwarding-rule   delete-project                delete-project-invitation     delete-remote-access-vpn      delete-security-group         
delete-snapshot               delete-snapshot-policies      delete-ssh-key-pair           delete-static-route           delete-tags                   
delete-template               delete-volume                 delete-vpc                    delete-vpn-connection         delete-vpn-customer-gateway   
delete-vpn-gateway            deliver                       denominator                   deploy-virtual-machine        deref                         
derive                        descendants                   destroy-virtual-machine       destructure                   detach-iso                    
detach-volume

To pass arguments to a call follow the syntax:

user=> (list-templates cs :templatefilter "executable")

Start a virtual machine

To deploy a virtual machine you need to get the serviceofferingid or instance type, the templateid also known as the image id and the zoneid, the call is then very similar to CloudMonkey and returns a jobid

user=> (deploy-virtual-machine cs :serviceofferingid "71004023-bb72-4a97-b1e9-bc66dfce9470" :templateid "1d961c82-7c8c-4b84-b61b-601876dab8d0" :zoneid "1128bd56-b4d9-4ac6-a7b9-c715b187ce11")
{:deployvirtualmachineresponse {:id "d0a887d2-e20b-4b25-98b3-c2995e4e428a", :jobid "21d20b5c-ea6e-4881-b0b2-0c2f9f1fb6be"}}

You can pass additional parameters to the deploy-virtual-machine call, such as the keypair and the securitygroupname:

user=> (deploy-virtual-machine cs :serviceofferingid "71004023-bb72-4a97-b1e9-bc66dfce9470" :templateid "1d961c82-7c8c-4b84-b61b-601876dab8d0" :zoneid "1128bd56-b4d9-4ac6-a7b9-c715b187ce11" :keypair "exoscale")
{:deployvirtualmachineresponse {:id "b5fdc41f-e151-43e7-a036-4d87b8536408", :jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4"}}

To query the asynchronous job, you can use the query-async-job api call:

user=> (query-async-job-result cs :jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4")
{:queryasyncjobresultresponse {:jobid "418026fc-1009-4e7a-9721-7c9ad47b49e4", :jobprocstatus 0, :jobinstancetype "VirtualMachine", :accountid "b8c0baab-18a1-44c0-ab67-e24049212925", :jobinstanceid "b5fdc41f-e151-43e7-a036-4d87b8536408", :created "2013-12-16T12:25:21+0100", :jobstatus 0, :jobresultcode 0, :cmd "com.cloud.api.commands.DeployVMCmd", :userid "968f6b4e-b382-4802-afea-dd731d4cf9b9"}}

And finally to destroy the virtual machine you would pass the id of the VM to the destroy-virtual-machine call like so:

user=> (destroy-virtual-machine cs :id "d0a887d2-e20b-4b25-98b3-c2995e4e428a")
{:destroyvirtualmachineresponse {:jobid "8fc8a8cf-9b54-435c-945d-e3ea2f183935"}}

With these simple basics you can keep on exploring clostack and the CloudStack API.

Use CloStack within your own clojure project

Hello World in clojure

To write your own clojure project that makes user of clostack, use leiningen to create a project skeleton

lein new toto

Lein will automatically create a src/toto/core.clj file, edit it to replace the function foo with -main. This dummy function returns Hellow World !. Let's try to execute it. First we will need to define the main namespace in the project.clj file. Edit it like so:

defproject toto "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :main toto.core :dependencies [[org.clojure/clojure "1.5.1"]])

Note the :main toto.core

You can now execute the code with lein run john . Indeed if you check the -main function in src/toto/core.clj you will see that it takes an argument. Surprisingly you should see the following output:

$ lein run john
john Hello, World!

Let's now add the CloStack dependency and modify the main function to return the zone of the CloudStack cloud.

Adding the Clostack dependency

Edit the project.clj to add a dependency on clostack and a few logging packages:

:dependencies [[org.clojure/clojure "1.5.1"]
               [clostack "0.1.3"]
               [org.clojure/tools.logging "0.2.6"]
               [org.slf4j/slf4j-log4j12   "1.6.4"]
               [log4j/apache-log4j-extras "1.0"]
               [log4j/log4j               "1.2.16"
                :exclusions [javax.mail/mail
                             javax.jms/jms
                             com.sun.jdkmk/jmxtools
                             com.sun.jmx/jmxri]]])
                             

lein should have created a resources directory. In it, create a log4j.properties file like so:

$ more log4j.properties 
# Root logger option
log4j.rootLogger=INFO, stdout

# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

A discussion on logging is beyond the scope of this recipes, we merely add it in the configuration for a complete example.

Now you can edit the code in src/toto/core.clj with some basic calls.

(ns testclostack.core
  (:require [clostack.client :refer [http-client list-zones]]))

(defn foo
  "I don't do a whole lot."
  [x]
  (println x "Hello, World!"))

(def cs (http-client))

(defn -main [args]
  (println (list-zones cs))
  (println args "Hey Wassup")
  (foo args)
)

Simply run this clojure code with lein run joe in the source of your project. And that's it, you have sucessfully discovered the very basics of Clojure and used the CloudStack client clostack to write your first Clojure code. Now for something more significant, look at Pallet

Wednesday, December 18, 2013

2014 Cloud Predictions

Warning: this is written with a glass of wine on one hand, two days before vacation ...:)

1. CloudStack will abandon semantic versioning and adopt super hero names for its releases, this will make upgrade paths more understandable.

2. Someone will take Euca API server and stick CloudStack backend beneath it, adding Opennebula packaging will make this the best cloud distro of all.

3. I will finally make sense of NetflixOSS plethora of software and reach nirvana by integrating CloudStack in Asgard.

4. AWS will opensource its software killing OpenStack, and we will realize that in fact they use CloudStack with Euca in front.

5. I will understand what NFV, VNF and SDN really mean and come up with a new acronym that will set twitter on fire.

6. We will actually see some code in Solum.

7. bitcoin will crash and come back up at least five times.

8. Citrix stock will jump 100% on acquisition by IBM.

9. My boss will stop asking me for statistics.

10. Facebook will die on a Snowden revelation.

I will stop at 10 otherwise this could go on all night :)

Happy Holidays everyone

Friday, December 06, 2013

Veewee, Vagrant and CloudStack

Coming back from CloudStack conference the feeling that this is not about building clouds got stronger. This is really about what to do with them and how they bring you agility, faster-time to market and allow you to focus on innovation in your core business. A large component of this is Culture and a change of how we do IT. The DevOps movement is the embodiment of this change. Over in Amsterdam I was stoked to meet with folks that I had seen at other locations throughout Europe in the last 18 months. Folks from PaddyPower, SchubergPhilis, Inuits who all embrace DevOps. I also met new folks, including Hugo Correia from Klarna (CloudStack users) who came by to talk about Vagrant-cloudstack plugin. His talk and a demo by Roland Kuipers from Schuberg was enough to kick my butt and get me to finally check out Vagrant. I sprinkled a bit of Veewee and of course some CloudStack on top of it all. Have fun reading.

Automation is key to a reproducible, failure-tolerant infrastructure. Cloud administrators should aim to automate all steps of building their infrastructure and be able to re-provision everything with a single click. This is possible through a combination of configuration management, monitoring and provisioning tools. To get started in created appliances that will be automatically configured and provisioned, two tools stand out in the arsenal: Veewee and Vagrant.

Veewee: Veewee is a tool to easily create appliances for different hypervisors. It fetches the .iso of the distribution you want and build the machine with a kickstart file. It integrates with providers like VirtualBox so that you can build these appliances on your local machine. It supports most commonly used OS templates. Coupled with virtual box it allows admins and devs to create reproducible base appliances. Getting started with veewee is a 10 minutes exericse. The README is great and there is also a very nice post that guides you through your first box building.

Most folks will have no issues cloning Veewee from github and building it, you will need ruby 1.9.2 or above. You can get it via `rvm` or your favorite ruby version manager.

git clone https://github.com/jedi4ever/veewee
gem install bundler
bundle install

Setting up an alias is handy at this point `alias veewee="bundle exec veewee"`. You will need a virtual machine provider (e.g VirtualBox, VMware Fusion, Parallels, KVM). I personnaly use VirtualBox but pick one and install it if you don't have it already. You will then be able to start using `veewee` on your local machine. Check the sub-commands available (for virtualbox):

$ veewee vbox
Commands:
  veewee vbox build [BOX_NAME]                     # Build box
  veewee vbox copy [BOX_NAME] [SRC] [DST]          # Copy a file to the VM
  veewee vbox define [BOX_NAME] [TEMPLATE]         # Define a new basebox starting from a template
  veewee vbox destroy [BOX_NAME]                   # Destroys the virtualmachine that was built
  veewee vbox export [BOX_NAME]                    # Exports the basebox to the vagrant format
  veewee vbox halt [BOX_NAME]                      # Activates a shutdown the virtualmachine
  veewee vbox help [COMMAND]                       # Describe subcommands or one specific subcommand
  veewee vbox list                                 # Lists all defined boxes
  veewee vbox ostypes                              # List the available Operating System types
  veewee vbox screenshot [BOX_NAME] [PNGFILENAME]  # Takes a screenshot of the box
  veewee vbox sendkeys [BOX_NAME] [SEQUENCE]       # Sends the key sequence (comma separated) to the box. E.g for testing the :boot_cmd_sequence
  veewee vbox ssh [BOX_NAME] [COMMAND]             # SSH to box
  veewee vbox templates                            # List the currently available templates
  veewee vbox undefine [BOX_NAME]                  # Removes the definition of a basebox 
  veewee vbox up [BOX_NAME]                        # Starts a Box
  veewee vbox validate [BOX_NAME]                  # Validates a box against vagrant compliancy rules
  veewee vbox winrm [BOX_NAME] [COMMAND]           # Execute command via winrm

Options:
          [--debug]           # enable debugging
  -w, --workdir, [--cwd=CWD]  # Change the working directory. (The folder containing the definitions folder).
                              # Default: /Users/sebgoa/Documents/gitforks/veewee

Pick a template from the `templates` directory and `define` your first box:

veewee vbox define myfirstbox CentOS-6.5-x86_64-minimal

You should see that a `defintions/` directory has been created, browse to it and inspect the `definition.rb` file. You might want to comment out some lines, like removing `chef` or `puppet`. If you don't change anything and build the box, you will then be able to `validate` the box with `veewee vbox validate myfirstbox`. To build the box simply do:

veewee vbox build myfristbox

Everything should be successfull, and you should see a running VM in your virtual box UI. To export it for use with `Vagrant`, `veewee` provides an export mechanism (really a VBoxManage command): `veewee vbox export myfirstbox`. At the end of the export, a .box file should be present in your directory.

Vagrant: Picking up from where we left with `veewee`, we can now add the box to Vagrant and customize it with shell scripts or much better, with Puppet recipes or Chef cookbooks. First let's add the box file to Vagrant:

vagrant box add 'myfirstbox' '/path/to/box/myfirstbox.box'

Then in a directory of your choice, create the Vagrant "project":

 
vagrant init 'myfirstbox'

This will create a `Vagrantfile` that we will later edit to customize the box. You can boot the machine with `vagrant up` and once it's up , you can ssh to it with `vagrant ssh`.

While `veewee` is used to create a base box with almost no customization (except potentially a chef and/or puppet client), `vagrant` is used to customize the box using the Vagrantfile. For example, to customize the `myfirstbox` that we just built, set the memory to 2 GB, add a host-only interface with IP 192.168.56.10, use the apache2 Chef cookbook and finally run a `boostrap.sh` script, we will have the following `Vagrantfile`:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "myfirstbox"
  config.vm.provider "virtualbox" do |vb|
    vb.customize ["modifyvm", :id, "--memory", 2048]
  end

  #host-only network setup
  config.vm.network "private_network", ip: "192.168.56.10"

  # Chef solo provisioning
  config.vm.provision "chef_solo" do |chef|
     chef.add_recipe "apache2"
  end

  #Test script to install CloudStack
  #config.vm.provision :shell, :path => "bootstrap.sh"
  
end

The cookbook will be in a `cookbooks` directory and the boostrap script will be in the root directory of this vagrant definition. For more information, check the Vagrant website and experiment.

Vagrant and CloudStack: What is very interesting with Vagrant is that you can use various plugins to deploy machines on public clouds. There is a `vagrant-aws` plugin and of course a `vagrant-cloudstack` plugin. You can get the latest CloudStack plugin from github. You can install it directly with the `vagrant` command line:

vagrant plugin install vagrant-cloudstack

Or if you are building it from source, clone the git repository, build the gem and install it in `vagrant`

git clone https://github.com/klarna/vagrant-cloudstack.git
gem build vagrant-cloudstack.gemspec
gem install vagrant-cloudstack-0.1.0.gem
vagrant plugin install /Users/sebgoa/Documents/gitforks/vagrant-cloudstack/vagrant-cloudstack-0.1.0.gem

The only drawback that I see is that one would want to upload his local box (created from the previous section) and use it. Instead one has to create `dummy boxes` that use existing templates available on the public cloud. This is easy to do, but creates a gap between local testing and production deployments. To build a dummy box simply create a `Vagrantfile` file and a `metadata.json` file like so:

$ cat metadata.json 
{
    "provider": "cloudstack"
}
$ cat Vagrantfile 
# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.provider :cloudstack do |cs|
    cs.template_id = "a17b40d6-83e4-4f2a-9ef0-dce6af575789"
  end
end

Where the `cs.template_id` is a uuid of a CloudStack template in your cloud. CloudStack users will know how to easily get those uuids with `CloudMonkey`. Then create a `box` file with `tar cvzf cloudstack.box ./metadata.json ./Vagrantfile`. Note that you can add additional CloudStack parameters in this box definition like the host,path etc (something to think about :) ). Then simply add the box in `Vagrant` with:

vagrant box add ./cloudstack.box

You can now create a new `Vagrant` project:

mkdir cloudtest
cd cloudtest
vagrant init

And edit the newly created `Vagrantfile` to use the `cloudstack` box. Add additional parameters like `ssh` configuration, if the box does not use the default from `Vagrant`, plus `service_offering_id` etc. Remember to use your own api and secret keys and change the name of the box to what you created. For example on exoscale:

# -*- mode: ruby -*-
# vi: set ft=ruby :

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "cloudstack"

  config.vm.provider :cloudstack do |cs, override|
    cs.host = "api.exoscale.ch"
    cs.path = "/compute"
    cs.scheme = "https"
    cs.api_key = "PQogHs2sk_3..."
    cs.secret_key = "...NNRC5NR5cUjEg"
    cs.network_type = "Basic"

    cs.keypair = "exoscale"
    cs.service_offering_id = "71004023-bb72-4a97-b1e9-bc66dfce9470"
    cs.zone_id = "1128bd56-b4d9-4ac6-a7b9-c715b187ce11"

    override.ssh.username = "root" 
    override.ssh.private_key_path = "/path/to/private/key/id_rsa_example"
  end

  # Test bootstrap script
  config.vm.provision :shell, :path => "bootstrap.sh"

end

The machine is brought up with:

vagrant up --provider=cloudstack

The following example output will follow:

$ vagrant up --provider=cloudstack
Bringing machine 'default' up with 'cloudstack' provider...
[default] Warning! The Cloudstack provider doesn't support any of the Vagrant
high-level network configurations (`config.vm.network`). They
will be silently ignored.
[default] Launching an instance with the following settings...
[default]  -- Service offering UUID: 71004023-bb72-4a97-b1e9-bc66dfce9470
[default]  -- Template UUID: a17b40d6-83e4-4f2a-9ef0-dce6af575789
[default]  -- Zone UUID: 1128bd56-b4d9-4ac6-a7b9-c715b187ce11
[default]  -- Keypair: exoscale
[default] Waiting for instance to become "ready"...
[default] Waiting for SSH to become available...
[default] Machine is booted and ready for use!
[default] Rsyncing folder: /Users/sebgoa/Documents/exovagrant/ => /vagrant
[default] Running provisioner: shell...
[default] Running: /var/folders/76/sx82k6cd6cxbp7_djngd17f80000gn/T/vagrant-shell20131203-21441-1ipxq9e
Tue Dec  3 14:25:49 CET 2013
This works

Which is a perfect execution of my amazing bootstrap script:

#!/usr/bin/env bash

/bin/date
echo "This works"

You can now start playing with Chef cookbooks, Puppet recipes or SaltStack formulas and automate the configuration of your cloud instances, thanks to Veewee Vagrant and CloudStack.

Tuesday, November 05, 2013

Fluentd plugin to CloudStack

When it rains, it pours...Here is a quick write up to use Fluentd to log CloudStack events and usage. Fluentd is an open source software to collect events and logs in JSON format. It has hundreds of plugins that allows you to store the logs/events in your favorite data store like AWS S3, MongoDB and even elasticsearch. It is an equivalent to logstash. The source is available on Github but can also be installed via your favorite package manager (e.g brew, yum, apt, gem). A CloudStack plugin has been written to be able to listen to CloudStack events and store these events in a chosen storage backend. In this blog I will show you how to store CloudStack logs in MongoDB using Fluent. Note that the same thing can be done with logstash, just ask @pyr. The documentation is quite straightforward, but here are the basic steps.

You will need a working `fluentd` installed on your machine. Pick your package manager of choice and install `fluentd`, for instance with `gem` we would do:

    sudo gem install fluentd

`fluentd` will now be in your path, you need to create a configuration file and start `fluentd` using this config. For additional options with `fluentd` just enter `fluentd -h`. The `s` option will create a sample configuration file in the working directory. The `-c` option will start `fluentd` using the specific configuration file. You can then send a test log/event message to the running process with `fluent-cat`.

    $ fluentd -s conf
    $ fluentd -c conf/fluent.conf &
    $ echo '{"json":"message"}' | fluent-cat debug.test

The CloudStack plugin:
CloudStack has a `listEvents` API which does what is says :) it lists events happening within a CloudStack deployment. Such events as the start and stop of a virtual machine, creation of security groups, life cycles events of storage elements, snapshots etc. The `listEvents` API is well documented. Based mostly on this API and the fog ruby library, a CloudStack plugin for `fluentd` was written by Yuichi UEMURA. It is slightly different from using `logstash`, as with `logstash` you can format the log4j logs of the CloudStack management server and directly collect those. Here we rely mostly on the `listEvents` API.

You can install it from source:

    git clone https://github.com/u-ichi/fluent-plugin-cloudstack

Then build your own gem and install it with `sudo gem build fluent-plugin-cloudstack.gemspec` and `sudo gem install fluent-plugin-cloudstack-0.0.8.gem `

Or you install the gem directly:

    sudo gem install fluent-plugin-cloudstack

Generate a configuration file with `fluentd -s conf`, you can specify the path to your configuration file. Edit the configuraton to define a `source` as being from your CloudStack host. For instance if you a running a development environment locally:

    <source>
      type cloudstack
      host localhost
      apikey $cloudstack_apikey
      secretkey $cloustack_secretkey

      # optional
      protocol http             # https or http, default https
      path /client/api          # default /client/api
      port 8080                 # default 443
      #interval 300               # min 300, default 300
      ssl false                 # true or false, default true
      domain_id $cloudstack_domain_id
      tag cloudstack
    </source>

There is currently a small bug in the `interval` definition so I commented it out. You also want to define the tag explicitly as being `cloudstack`. You can then create a `match` section in the configuration file. To keep it simple at first, we will simply echo the events to `stdout`, therefore just add:

	<match cloudstack.**>
	  type stdout
	</match>

Run `fluentd` with `fluentd -c conf/fluent.conf &`, browse the CloudStack UI, create a VM, create a service offering, just do a few things to generate some events that should appear in stdout. Once the interval is passed you will see the events being written to `stdout`:

    $ 2013-11-05 12:19:26 +0100 [info]: starting fluentd-0.10.39
    2013-11-05 12:19:26 +0100 [info]: reading config file path="conf/fluent.conf"
    2013-11-05 12:19:26 +0100 [info]: using configuration file: <ROOT>
      <source>
        type forward
      </source>
      <source>
        type cloudstack
        host localhost
        apikey 6QN8jOzEfhR7Fua69vk5ocDo_tfg8qqkT7-2w7nnTNsSRyPXyvRRAy23683qcrflgliHed0zA3m0SO4W9kh2LQ
        secretkey HZiu9vhPAxA8xi8jpGWMWb9q9f5OL1ojW43Fd7zzQIjrcrMLoYekeP1zT9d-1B3DDMMpScHSR9gAnnG45ewwUQ
        protocol http
        path /client/api
        port 8080
        interval 3
        ssl false
        domain_id a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197
        tag cloudstack
      </source>
      <match debug.**>
        type stdout
      </match>
      <match cloudstack.**>
        type stdout
      </match>
    </ROOT>
    2013-11-05 12:19:26 +0100 [info]: adding source type="forward"
    2013-11-05 12:19:26 +0100 [info]: adding source type="cloudstack"
    2013-11-05 12:19:27 +0100 [info]: adding match pattern="debug.**" type="stdout"
    2013-11-05 12:19:27 +0100 [info]: adding match pattern="cloudstack.**" type="stdout"
    2013-11-05 12:19:27 +0100 [info]: listening fluent socket on 0.0.0.0:24224
    2013-11-05 12:19:27 +0100 [info]: listening cloudstack api on localhost
    2013-11-05 12:19:30 +0100 cloudstack.usages: {"events_flow":0}
    2013-11-05 12:19:30 +0100 cloudstack.usages: {"vm_sum":1,"memory_sum":536870912,"cpu_sum":1,"root_volume_sum":1400,"data_volume_sum":0,"Small Instance":1}
    2013-11-05 12:19:33 +0100 cloudstack.usages: {"events_flow":0}
    2013-11-05 12:19:33 +0100 cloudstack.usages: {"vm_sum":1,"memory_sum":536870912,"cpu_sum":1,"root_volume_sum":1400,"data_volume_sum":0,"Small Instance":1}
    2013-11-05 12:19:36 +0100 cloudstack.usages: {"events_flow":0}
    2013-11-05 12:19:36 +0100 cloudstack.usages: {"vm_sum":1,"memory_sum":536870912,"cpu_sum":1,"root_volume_sum":1400,"data_volume_sum":0,"Small Instance":1}
    2013-11-05 12:19:39 +0100 cloudstack.usages: {"events_flow":0}
    ...
    2013-11-05 12:19:53 +0100 cloudstack.event: {"id":"b5051963-33e5-4f44-83bc-7b78763dcd24","username":"admin","type":"VM.DESTROY","level":"INFO","description":"Successfully completed destroying Vm. Vm Id: 17","account":"admin","domainid":"a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197","domain":"ROOT","created":"2013-11-05T12:19:53+0100","state":"Completed","parentid":"d0d47009-050e-4d94-97d9-a3ade1c80ee3"}
    2013-11-05 12:19:53 +0100 cloudstack.event: {"id":"39f8ff37-515c-49dd-88d3-eeb77d556223","username":"admin","type":"VM.DESTROY","level":"INFO","description":"destroying Vm. Vm Id: 17","account":"admin","domainid":"a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197","domain":"ROOT","created":"2013-11-05T12:19:53+0100","state":"Started","parentid":"d0d47009-050e-4d94-97d9-a3ade1c80ee3"}
    2013-11-05 12:19:53 +0100 cloudstack.event: {"id":"d0d47009-050e-4d94-97d9-a3ade1c80ee3","username":"admin","type":"VM.DESTROY","level":"INFO","description":"destroying vm: 17","account":"admin","domainid":"a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197","domain":"ROOT","created":"2013-11-05T12:19:53+0100","state":"Scheduled"}
    2013-11-05 12:19:55 +0100 cloudstack.usages: {"events_flow":3}
    2013-11-05 12:19:55 +0100 cloudstack.usages: {"vm_sum":1,"memory_sum":536870912,"cpu_sum":1,"root_volume_sum":1400,"data_volume_sum":0,"Small Instance":1}
    ...
    2013-11-05 12:20:18 +0100 cloudstack.event: {"id":"11136a76-1de0-4907-b31d-2557bc093802","username":"admin","type":"SERVICE.OFFERING.CREATE","level":"INFO","description":"Successfully completed creating service offering. Service offering id=13","account":"system","domainid":"a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197","domain":"ROOT","created":"2013-11-05T12:20:18+0100","state":"Completed"}
    2013-11-05 12:20:19 +0100 cloudstack.usages: {"events_flow":1}
    2013-11-05 12:20:19 +0100 cloudstack.usages: {"vm_sum":1,"memory_sum":536870912,"cpu_sum":1,"root_volume_sum":1400,"data_volume_sum":0,"Small Instance":1}

I cut some of the output for brevity, note that I do have an interval listed as `3` because I did not want to wait 300 minutes. Therefore I installed from source and patched the plugin, it should be fixed in the source soon. You might have a different endpoint and of course different keys, and don't worry about me sharing that `secret_key` I am using a simulator, that key is already gone.

Getting the events and usage information on stdout is interesting, but the kicker comes from storing the data in a database or a search index. In this section we show to get closer to reality and use MongoDB to store the data. MongoDB is an open source document database which is schemaless and stores document in JSON format (BSON actually). Installation and query syntax of MongoDB is beyond the scope of this chapter. MongoDB clusters can be setup with replication and sharding, in this section we use MongoDB on a single host with no sharding or replication. To use MongoDB as a storage backend for the events, we first need to install `mongodb`. On single OSX node this is as simple as `sudo port install mongodb`. For other OS use the appropriate package manager. You can then start mongodb with `sudo mongod --dbpath=/path/to/your/databases`. Create a `fluentd` database and a `fluentd` user with read/write access to it. In the mongo shell do:

    $sudo mongo
    >use fluentd
    >db.AddUser({user:"fluentd", pwd: "foobar", roles: ["readWrite", "dbAdmin"]})

We then need to install the `fluent-plugin-mongodb`. Still using `gem` this will be done like so:

    $sudo gem install fluent-plugin-mongo.

The complete documentation also explains how to modify the configuration of `fluentd` to use this backend. Previously we used `stdout` as the output backend, to use `mongodb` we just need to write a different `` section like so:

	# Single MongoDB
	<match cloudstack.**>
	  type mongo
	  host fluentd
	  port 27017
	  database fluentd
	  collection test

	  # for capped collection
	  capped
	  capped_size 1024m

	  # authentication
	  user fluentd
	  password foobar

	  # flush
	  flush_interval 10s
	</match>

Note that you cannot have multiple `match` section for the same tag pattern.

To view the events/usages in Mongo, simply start a mongo shell with `mongo -u fluentd -p foobar fluentd` and list the collections. You will see the `test` collection:

    $ mongo -u fluentd -p foobar fluentd
    MongoDB shell version: 2.4.7
    connecting to: fluentd
    Server has startup warnings: 
    Fri Nov  1 13:11:44.855 [initandlisten] 
    Fri Nov  1 13:11:44.855 [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
    > show collections
    system.indexes
    system.users
    test

Couple MongoDB commands will get your rolling, `db.getCollection`, `count()` and `findOne()`:

    > coll=db.getCollection('test')
    fluentd.test
    > coll.count()
    181
    > coll.findOne()
    {
    	"_id" : ObjectId("5278d9822675c98317000001"),
	    "events_flow" : 0,
	    "time" : ISODate("2013-11-05T11:41:47Z")
    }

The `find()` call returns all entries in the collection.

    > coll.find()
    { "_id" : ObjectId("5278d9822675c98317000001"), "events_flow" : 0, "time" : ISODate("2013-11-05T11:41:47Z") }
    { "_id" : ObjectId("5278d9822675c98317000002"), "vm_sum" : 0, "memory_sum" : 0, "cpu_sum" : 0, "root_volume_sum" : 1500, "data_volume_sum" : 0, "Small Instance" : 1, "time" : ISODate("2013-11-05T11:41:47Z") }
    { "_id" : ObjectId("5278d98d2675c98317000009"), "events_flow" : 0, "time" : ISODate("2013-11-05T11:41:59Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000a"), "vm_sum" : 0, "memory_sum" : 0, "cpu_sum" : 0, "root_volume_sum" : 1500, "data_volume_sum" : 0, "Small Instance" : 1, "time" : ISODate("2013-11-05T11:41:59Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000b"), "id" : "1452c56a-a1e4-43d2-8916-f83a77155a2f", "username" : "admin", "type" : "VM.CREATE", "level" : "INFO", "description" : "Successfully completed starting Vm. Vm Id: 19", "account" : "admin", "domainid" : "a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197", "domain" : "ROOT", "created" : "2013-11-05T12:42:01+0100", "state" : "Completed", "parentid" : "df68486e-c6a8-4007-9996-d5c9a4522649", "time" : ISODate("2013-11-05T11:42:01Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000c"), "id" : "901f9408-ae05-424f-92cd-5693733de7d6", "username" : "admin", "type" : "VM.CREATE", "level" : "INFO", "description" : "starting Vm. Vm Id: 19", "account" : "admin", "domainid" : "a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197", "domain" : "ROOT", "created" : "2013-11-05T12:42:00+0100", "state" : "Scheduled", "parentid" : "df68486e-c6a8-4007-9996-d5c9a4522649", "time" : ISODate("2013-11-05T11:42:00Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000d"), "id" : "df68486e-c6a8-4007-9996-d5c9a4522649", "username" : "admin", "type" : "VM.CREATE", "level" : "INFO", "description" : "Successfully created entity for deploying Vm. Vm Id: 19", "account" : "admin", "domainid" : "a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197", "domain" : "ROOT", "created" : "2013-11-05T12:42:00+0100", "state" : "Created", "time" : ISODate("2013-11-05T11:42:00Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000e"), "id" : "924ba8b9-a9f2-4274-8bbd-c27947d2c246", "username" : "admin", "type" : "VM.CREATE", "level" : "INFO", "description" : "starting Vm. Vm Id: 19", "account" : "admin", "domainid" : "a9e4b8f0-3fd5-11e3-9df7-78ca8b5a2197", "domain" : "ROOT", "created" : "2013-11-05T12:42:00+0100", "state" : "Started", "parentid" : "df68486e-c6a8-4007-9996-d5c9a4522649", "time" : ISODate("2013-11-05T11:42:00Z") }
    { "_id" : ObjectId("5278d98d2675c9831700000f"), "events_flow" : 4, "time" : ISODate("2013-11-05T11:42:02Z") } 
    { "_id" : ObjectId("5278d98d2675c98317000010"), "vm_sum" : 1, "memory_sum" : 536870912, "cpu_sum" : 1, "root_volume_sum" : 1600, "data_volume_sum" : 0, "Small Instance" : 1, "time" : ISODate("2013-11-05T11:42:02Z") }
    Type "it" for more

We leave it to you to learn the MongoDB query syntax and the great aggregation framework, have fun. Of course you can get the data into elasticsearch as well :)

OCCI interface to CloudStack

CloudStack has its own API. Cloud wrappers like libcloud and jclouds work well with this native API, but CloudStack does not expose any standard API like OCCI and CIMI. We (Isaac Chiang really, I just tested and pointed him in the right direction) started working on a CloudStack backend for rOCCI using our CloudStack ruby gem. The choice of rOCCI was made due to the existence of an existing Opennebula backend and the adoption of OCCI in the European Grid Initiative Federated cloud testbed.

Let's get started with installing the rOCCI server, this work has not yet been merged upstream so you will need to work from Isaac Chiang's fork.

    git clone https://github.com/isaacchiang/rOCCI-server.git
    bundle install
    cd etc/backend
    cp cloudstack/cloudstack.json default.json

Edit the defautl.json file to contain the information about your CloudStack cloud (e.g apikey, secretkey, endpoint). Start the rOCCI server:

    bundle exec passenger start

The server should be running on http://0.0.0.0:3000 and run the tests:

    bundle exec rspec

This was tested with the CloudStack simulator and a basic zone configuration, help us test it in production clouds.

You can also try an OCCI client. Install the rOCCI client from Github:

    git clone https://github.com/gwdg/rOCCI-cli.git

    cd rOCCI-cli
    gem install bundler
    bundle install
    bundle exec rake test
    rake install

You will then be able to use the OCCI client:

    occi --help

Test it against the server that you are started previously. You will need a running CloudStack cloud. Either a production one or a dev instance using DevCloud. The credentials and the endpoint to this cloud will have been entered in `default.json` file that you created in the previous section. Try a couple OCCI client command:

    $ occi --endpoint http://0.0.0.0:3000/ --action list --resource os_tpl

    Os_tpl locations:
     os_tpl#6673855d-ce9b-4997-8613-6830de037a8f

    $ occi --endpoint http://0.0.0.0:3000/ --action list --resource resource_tpl

    Resource_tpl locations:
     resource_tpl##08ba0343-bd39-4bf0-9aab-4953694ae2b4
     resource_tpl##f78769bd-95ea-4139-ad9b-9dfc1c5cb673
     resource_tpl##0fd364a9-7e33-4375-9e10-bb861f7c6ee7

You will recognize the `uuid` from the templates and service offerings that you have created in CloudStack. To start an instance:

    $ occi --endpoint http://0.0.0.0:3000/ --action create --resource compute 
           --mixin os_tpl#6673855d-ce9b-4997-8613-6830de037a8f 
           --mixin resource_tpl#08ba0343-bd39-4bf0-9aab-4953694ae2b4
           --resource-title foobar

A handle on the resource created will be returned. That's it !

We will keep on improving this driver to provide a production quality OCCI interface to users who want to use a standard. In all fairness we will also work on a CIMI implementation. Hopefully some of the clouds in the EGI federated cloud will pick CloudStack and help us improve this OCCI interface. In CloudStack we aim to provide the interfaces that the users want and keep them up to date and of production quality so that users can depend on it.

Thursday, October 17, 2013

Why I will go to CCC13 in Amsterdam ?

Aside from the fact that I work full-time on Apache CloudStack, that I am on the organizing committee and that my boss would kill me if I did not go to the CloudStack Collaboration conference, there are many great reasons why I want to go as an open source enthusiast, here is why:

It's Amsterdam and we are going to have a blast (the city of Amsterdam is even sponsoring the event). The venue -Beurs Van Berlage- is terrific, this is the same venue where the Hadoop summit is held and where the AWS Benelux Summit was couple weeks ago. We are going to have a 24/7 Developer room (thanks to CloudSoft) where we can meet to hack on CloudStack and its ecosystem, three parallel tracks in other rooms and great evening events. The event is made possible by the amazing local support from the team at Schuberg Philis, a company that has devops in its vein and organized DevOps days Amsterdam. I am not being very subtle in acknowledging our sponsors here, but hey, without them this would not be possible.

On the first day (November 20th) is the Hackathon sponsored by exoscale. In parallel to the hackathon, new users of CloudStack will be able to attend a full day bootcamp run by the super competent guys from Shapeblue, they also play guitar and drink beers so make sure to hang out with them :). Even as cool is that the CloudStack community recognizes that building a Cloud takes many components, so we will have a jenkins workshop and an elasticsearch workshop. I am big fan of elasticsearch, not only for keeping your infrastructure logs but also for other types of data. I actually store all CloudStack emails in an elasticsearch cluster. Jenkins of course is at the heart of everyone's continuous integration systems these days. Seeing those two workshops, it will be no surprise to see a DevOps track the next two days.

Kicking off the second day -first day of talks- we will have a keynote by Patrick Debois the jedi master of DevOps. We will then break up into a user track, a developer track, a commercial track and for this day only a devops track with a 'culture' flavor. The hard work will begin: choosing which talk to attend. I am not going to go through every talk, we received a lot of great submissions and choosing was hard. New CloudStack users or people looking into using CloudStack will gain a lot from the case studies being presented in the user track while the developers will get a deep dive into the advanced networking features of CloudStack including SDN support -right off the bat-. In the afternoon, the case studies will continue in the user track including a talk from NTT about how they built an AWS compatible cloud. I will have to head to the developer track for a session on 'interfaces' with a talk on jclouds, a new GCE interface that I worked on and my own talk on Apache libcloud for which I worked a lot on the CloudStack driver. The DevOps track will have an entertaining talk by Michael Ducy from Opscode, some real world experiences by John Turner and Noel King from Paddy Power and the VP of engineering for Citrix CloudPlatform will lead an interactive session on how to best work with the open source community of Apache CloudStack.

After recovering from the nights events, we will head into the second day with another entertaining keynote by John Willis. Here the choice will be hard between the storage session in the commercial track and the 'Future of CloudStack' session in the developer track. With talks from NetApp and SolidFire who have each developed a plugin in CloudStack plus our own Wido Den Hollander (PMC member) who wrote the Ceph integration the storage session will rock, but the 'Future of CloudStack' session will be key for developers, talking about frameworks, integration testing, system VMs...After lunch the user track will feature several intro to networking talks. Networking is the most difficult concept to grasp in clouds (IMHO). The storage session will continue with a talk by Basho on RiakCS (also integrated in CloudStack) and a panel. The dev track will be dedicated to discussions on PaaS, not to be missed if you ask me, as PaaS is the next step in Clouds. To wrap things up, I will have to decide between a session on metering/billing, a discussion on hypervisor choice and support, and a presentation on the CloudStack community in Japan after Ruv Cohen talking about trading cloud commodities.

The agenda is loaded and ready to fire, it will be tough to decide which sessions to attend but you will come out refreshed, energized with lots of new ideas to evolve your IT infrastructure, so one word: Register

And of course many thanks to our sponsors: Citrix, Schuberg Philis, Juniper, Sungard, Shapeblue, NetApp, cloudSoft, Nexenta, iKoula, leaseweb, solidfire, greenqloud, atom86, apalia, elasticsearch, 2source4, iamsterdam, cloudbees and 42on

Tuesday, October 01, 2013

A look at RIAK-CS from BASHO

Playing with Basho Riak CS Object Store

CloudStack deals with the compute side of a IaaS, the storage side which for most of us these days consists of a scalable, fault tolerant object store is left to other software. Ceph led by inktank and RiakCS from Basho are the two most talked about object store these days. In this post we look at RiakCS and take it for a quick whirl. CloudStack integrates with RiakCS for secondary storage and together they can offer an EC2 and a true S3 interface, backed by a scalable object store. So here it is.

While RiakCS (Cloud Storage) can be seen as an S3 backend implementation, it is based on Riak. Riak is a highly available distributed nosql database. The use of a consistent hashing algorithm allows riak to re-balance the data when node disappear (e.g fail) and when node appear (e.g increased capacity), it also allows to manage replication of data with an eventual consistency principle typical of large scale distributed storage system which favor availability over consistency.

To get a functioning RiakCS storage we need Riak, RiakCS and Stanchion. Stanchion is an interface that serializes http requests made to RiakCS.

A taste of Riak

To get started, let's play with Riak and build a cluster on our local machine. Basho has some great documentation, the toughest thing will be to install Erlang (and by tough I mean a 2 minutes deal), but again the docs are very helpful and give step by step instructions for almost all OS.

There is no need for me to re-create step by step instructions since the docs are so great, but the gist is that with the quickstart guide we can create a Riak cluster on `localhost`. We are going to start five Riak node (e.g we could start more) and join them into a cluster. This is as simple as:

    bin/riak start
    bin/riak-admin cluster join dev1@127.0.0.1

Where `dev1` was the first riak node started. Creating this cluster will re-balance the ring:

    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid     100.0%     20.3%    'dev1@127.0.0.1'
    valid       0.0%     20.3%    'dev2@127.0.0.1'
    valid       0.0%     20.3%    'dev3@127.0.0.1'
    valid       0.0%     20.3%    'dev4@127.0.0.1'
    valid       0.0%     18.8%    'dev5@127.0.0.1'

The `riak-admin` command is a nice cli to manage the cluster. We can check the membership of the cluster we just created, after some time the ring will have re-balanced to the expected state.

    dev1/bin/riak-admin member-status
    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid      62.5%     20.3%    'dev1@127.0.0.1'
    valid       9.4%     20.3%    'dev2@127.0.0.1'
    valid       9.4%     20.3%    'dev3@127.0.0.1'
    valid       9.4%     20.3%    'dev4@127.0.0.1'
    valid       9.4%     18.8%    'dev5@127.0.0.1'
    -------------------------------------------------------------------------------
    Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
   
    dev1/bin/riak-admin member-status
    ================================= Membership ==================================
    Status     Ring    Pending    Node
    -------------------------------------------------------------------------------
    valid      20.3%      --      'dev1@127.0.0.1'
    valid      20.3%      --      'dev2@127.0.0.1'
    valid      20.3%      --      'dev3@127.0.0.1'
    valid      20.3%      --      'dev4@127.0.0.1'
    valid      18.8%      --      'dev5@127.0.0.1'
    -------------------------------------------------------------------------------

You can then test your cluster by putting an image as explained in the docs and retrieving it in a browser (e.g an HTTP GET)

    curl -XPUT http://127.0.0.1:10018/riak/images/1.jpg 
         -H "Content-type: image/jpeg" 
         --data-binary @image_name_.jpg

Open the browser to `http://127.0.0.1:10018/riak/images/1.jpg` As easy as 1..2..3

Installing everything on Ubuntu 12.04

To move forward and build a complete S3 compatible object store, let's setup everything on a Ubuntu 12.04 machine. Back to installing `riak`, get the repo keys and setup a `basho.list` repository:

    curl http://apt.basho.com/gpg/basho.apt.key | sudo apt-key add -
    bash -c "echo deb http://apt.basho.com $(lsb_release -sc) main > /etc/apt/sources.list.d/basho.list"
    apt-get update

And grab `riak`, `riak-cs` and `stanchion`. I am not sure why but their great docs make you download the .debs separately and use `dpkg`.

    apt-get install riak riak-cs stanchion
 
Check that the binaries are in your path with `which riak`, `which riak-cs` and `which stanchion` , you should find everything in `/usr/sbin`. All configuration will be in `/etc/riak`, `/etc/riak-cs` and `/etc/stanchion` inspect especially the `app.config` which we are going to modify before starting everything. Note that all binaries have a nice usage description, it includes a console, a ping method and a restart among others:
    Usage: riak {start | stop| restart | reboot | ping | console | attach | 
                        attach-direct | ertspath | chkconfig | escript | version | 
                        getpid | top [-interval N] [-sort reductions|memory|msg_q] [-lines N] }

Configuration

Before starting anything we are going to configure every component, which means editing the `app.config` files in each respective directory. For `riak-cs` I only made sure to set `{anonymous_user_creation, true}`, I did nothing for configuring `stanchion` as I used the default ports and ran everything on `localhost` without `ssl`. Just make sure that you are not running any other application on port `8080` as `riak-cs` will use this port by default. For configuring `riak` see the documentation, it sets a different backend that what we used in the `tasting` phase :) With all these configuration done you should be able to start all three components:
    riak start
    riak-cs start
    stanchion start
You can `ping` every component and check the console with `riak ping`, `riak-cs ping` and `stanchion ping`, I let you figure out the console access. Create an admin user for `riak-cs`
    curl -H 'Content-Type: application/json' -X POST http://localhost:8080/riak-cs/user \
         --data '{"email":"foobar@example.com", "name":"admin user"}'
 
If this returns successfully this should be a good indication that your setup is working properly. In the response we recognized API and secret keys
    {"email":"foobar@example.com",
     "display_name":"foobar",
     "name":"admin user",
     "key_id":"KVTTBDQSQ1-DY83YQYID",
     "key_secret":"2mNGCBRoqjab1guiI3rtQmV3j2NNVFyXdUAR3A==",
     "id":"1f8c3a88c1b58d4b4369c1bd155c9cb895589d24a5674be789f02d3b94b22e7c",
     "status":"enabled"}
 
Let's take those and put them in our `riak-cs` configuration file, there are `admin_key` and `admin_secret` variables to set. Then restart with `riak-cs restart`. Don't forget to also add those in the `stanchion` configuration file `/etc/stanchion/app.config` and restart it `stanchion restart`.

Using our new Cloud Storage with Boto

Since Riak-CS is S3 Compatible clouds storage solution, we should be able to use an S3 client like Python boto to create buckets and store data. Let's try. You will need boto of course, `apt-get install python-boto` and then open an interactive shell `python`. Import the modules and create a connection to `riak-cs`
    >>> from boto.s3.key import Key
    >>> from boto.s3.connection import S3Connection
    >>> from boto.s3.connection import OrdinaryCallingFormat
    >>> apikey='KVTTBDQSQ1-DY83YQYID'
    >>> secretkey='2mNGCBRoqjab1guiI3rtQmV3j2NNVFyXdUAR3A=='
    >>> cf=OrdinaryCallingFormat()
    >>> conn=S3Connection(aws_access_key_id=apikey,aws_secret_access_key=secretkey,
                          is_secure=False,host='localhost',port=8080,calling_format=cf)
 
Now you can list the bucket, which will be empty at first. Then create a bucket and store content in it with various keys
    >>> conn.get_all_buckets()
    []
    >>> bucket=conn.create_bucket('riakbucket')
    >>> k=Key(bucket)
    >>> k.key='firstkey'
    >>> k.set_contents_from_string('Object from first key')
    >>> k.key='secondkey'
    >>> k.set_contents_from_string('Object from second key')
    >>> b=conn.get_all_buckets()[0]
    >>> k=Key(b)
    >>> k.key='secondkey'
    >>> k.get_contents_as_string()
    'Object from second key'
    >>> k.key='firstkey'
    >>> k.get_contents_as_string()
    'Object from first key'

And that's it, an S3 compatible object store backed by a NOSQL distributed database that uses consistent hashing, all of it in erlang. Automate all of it with Chef recipe. Hook that up to your CloudStack EC2 compatible cloud, use it as secondary storage to hold templates or make it a public facing offering and you have the second leg of the Cloud: storage. Sweet...Next post I will show you how to use it with CloudStack.

Tuesday, September 03, 2013

CloudStack Google Summer of Code projects

Google Summer of Code is entering the final stretch with pencil down on Sept 16th and final evaluation on Sept 27th. Of the five projects CloudStack had this summer, one failed at mid-term and one led to committer status couple weeks ago. That's 20% failure and 20% outstanding results, on par with GSoC wide statistics I believe.

The LDAP integration has been the most productive project. Ian Duffy a 20 year old from Dublin did an outstanding job, developing his new feature in a feature branch, building a jenkins pipeline to test everything and submitting a merge request to master couple weeks ago. With 90% unittest coverage, static code analysis with Sonar in his jenkins pipeline and automatic publishing of rpms in a local yum repo, Ian exceed expectation. His code has even been already backported to the 4.1.1 release with the CloudSand distro of CloudStack.

The SDN extension project was about taking the native GRE controller in CloudStack and extend it to support XCP and KVM. Nguyen from Vietnam has done an excellent job quickly adding support for XCP thanks to his expertise with Xen. He is now putting the final touches on KVM support and building L3 services with OpenDaylight. The entire GRE controller was re-factored to be a plugin similar to the Nicira NVP, Midonet and BigSwitch BVS plugin. While native to CloudStack this controller brings another SDN solution to CloudStack. I expect to see his merge request before pencil down for what will be an extremely valuable project.

While the CloudStack UI is great, it was actually written has a demonstration of how the CloudStack API could be used to build a user facing portal. With the "new UI" project, Shiva Teja from India used boostrap and Angular to create a new UI. Originally the project suggested to use backbone but after feedback from the community Shiva switch to using Angular. Shiva's effort are to be commended as he truly worked on his own with in-consistent network connectivity and no local mentoring. Shiva is a bachelor student and had to learn bootstrap, angular and also Flask on his own. It must have been paying off since he is interviewing with Amazon and Goole for internships next summer. His code being independent of the CloudStack code has been committed to master in our tools directory. This creates a solid framework for other to build on and create their own CloudStack UI.

Perhaps the most research oriented project has been the one from Meng Han from Florida. This was no standard coding projects as it required not only to learn new technologies (aside from CloudStack) but also required investigation of the Amazon EMR API. Meng had to implement EMR in CloudStack using Apache Whirr. Whirr is a java library for provisioning of virtual machines on cloud providers. Whirr uses Apache jclouds and can interact with most cloud providers out there. Meng developed a new set of CloudStack APIs to launch hadoop clusters on-demand. At the start she had to learn CloudStack and install it, then learn the Whirr library and subsequently create a new API in CloudStack which would use Whirr to coordinate multiple node deployments. Meng's code is working but still a bit short from our goal of having a AWS EMR interface. This is partly my fault has this project could have required more mentoring. In any case, the work will go on and I expect to see an EMR implementation in CloudStack in the coming months.

All students faced the same challenge, not a code writing challenge but the OSS challenge and specifically learning the Apache Way. Apache is about consensus and public discussions on the mailing list. With several hundreds participants every month and very active discussion, the shear amount of email traffic can be intimidating. Sharing issues and asking for help on public mailing list is still a bit frightening. IRC, intense emailing, JIRA and git are basic tools used in all Apache project, but seldom used in academic settings. Learning these development tools and participating in a project with over a million line of code was the toughest challenge for students and the goal of GSoC. I am glad we got five students to join CloudStack this summer and tackle these challenges, if anything it is a terrific experience that will benefit their own academic endeavor and later their entire career. Great job Ian, Meng, Nguyen, Shiva and Dharmesh, we are not done yet but I wish you all the best.

Tuesday, August 27, 2013

About those Cloud APIs....

There has been a lot of discussions lately within the OpenStack community on the need for an AWS API interface to OpenStack nova. I followed the discussion from far via a few tweets, but I am of the opinion that any IaaS solution does need to expose an AWS interface. AWS is the leader in Cloud and has been since 2006 -yes that's seven years- Users are accustomed to it and the AWS API is the de-factor standard.

When Eucalyptus started, it's main goal was to become an AWS clone and in 2012 signed an agreement with Amazon to offer seamless AWS support in Eucalyptus. Opennebula has almost always offered an AWS bridge and CloudStack has too, even though in total disclosure the interface was broken in the Apache CloudStack 4.1 release. Thankfully the AWS interface is now fixed in 4.1.2 release and will also be in the upcoming 4.2 release. To avoid breaking this interface we are developing a jenkins pipeline which will test it using the Eucalyptus testing suite.

Opennebula recently ran a survey to determine where to best put its efforts in API development. The results where clear with 47% of respondents asking for better AWS compatibility. There are of course developing official standards from standard organizations, most notably OCCI from OGF and CIMI from DMTF. The opennebula survey seems to indicate a stronger demand for OCCI than CIMI, but IMHO this is due to historical reasons: Opennebula early efforts in being the first OCCI implementation and Opennebula user base especially within projects like HelixNebula.

CIMI was promising and probably still is but it will most likely face an up-hill battle since RedHat announced it's scaling back on supporting Apache DeltaCloud. I recently heard about a new CIMI implementation project for Stratuslab from some of my friends at Sixsq, it is interesting and fun because written in Clojure and I hope to see it used with Clostack to provide a CIMI interface to CloudStack. We may be couple weeks out :)

While AWS is the de-facto standard, I want to make sure that CloudStack offers choices for its users. If someone wants to use OCCI and CIMI or AWS or the native CloudStack API they should be able to. I will be at the CloudPlugfest Interoperability week in Madrid Sept 18-20 and I hope to demonstrate a brand new OCCI interface to CloudStack using rOCCI and CloudStack ruby gem. A CloudStack contributor from Taiwan has been working on it.

The main issue with all these "standard" interfaces is that they will never give you the complete API of a given IaaS implementation. They by nature provide the lowest common denominator. That roughly means that the user facing APIs could be standardized but the administrator API will always remain hidden and non-standard. In CloudStack for instance, there are over 300 API calls. While we can expose a compatible interface, this will always only cover a subset of the overall API. It also brings the question of all the other AWS services: EMR, Beanstalk, CloudFormation... Standardizing on those will be extremely difficult if not impossible.

So yes, we should expose an AWS compatible interface, but we should also have OCCI, CIMI and of course our native API. Making those bridges is not hard, what's hard is the implementation behind it.

All of this would leave us with Google Compute Engine (GCE), and I should be able to bring back some good news by the end of september, stay tuned !!!

Friday, July 12, 2013

And yes Elasticsearch cluster with Whirr and CloudStack as well

In my previous post I showed how to deploy hadoop clusters on-demand using Apache Whirr and CloudStack. Whirr can do much more than hadoop: Cassandra, Ganglia, Solr, Zookeeper etc...and of course Elasticsearch.

This post is really a little wink at a CloudStack PMC member and also a start of investigating if ES would be a better choice than Mongodb for all my email and logs analysis.

Let's get to it. In the Whirr source find the elasticsearch.properties file under /recipes modify it to your liking and to your cloud:

For this test I am using exoscale again, but any CloudStack cloud like ikoula, pcextreme or leaseweb will do (some European chauvinism right there :) ). In a basic zone, specify your ssh keypairs that whirr will use to boostrap ES.

whirr.cloudstack-keypair=exoscale

Set the number of machines in the cluster

# Change the number of machines in the cluster here
whirr.instance-templates=2 elasticsearch

Set the instance type of each cluster instanc (e.g large, extra-large etc..). I have not tried to pass them by name I just use the uuid from CloudStack.

whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8

And of course define the endpoint

whirr.provider=cloudstack
whirr.endpoint=https://api.exoscale.ch/compute

And define the template that you want to use (e.g Ubuntu, CentOS etc)

whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2

Finally, define the ES tarball you want to use. Whirr has not updated this in a long time, so the default is still set at 0.15. Remember to change it.

# You can specify the version by setting the tarball url
whirr.elasticsearch.tarball.url=http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.2.tar.gz

Then launch the cluster

whirr launch-cluster --config elasticsearch.properties

When whirr is done launching and bootstrapping the instance you will get something like:

You can log into instances using the following ssh commands:
[elasticsearch]: ssh -i /Users/toto/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o 
StrictHostKeyChecking=no toto@185.19.28.90 [elasticsearch]: ssh -i /Users/toto/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o
StrictHostKeyChecking=no toto@185.19.28.92 To destroy cluster, run 'whirr destroy-cluster' with the same options used to launch it.

Don't bother with those IPs, the cluster is already dead when you read this and the security group are set so that I am the only one who can access them. That said you now have a working elasticsearch cluster in the cloud and you can hit it with the API:

$ curl -XGET 'http://185.19.28.90:9200/_cluster/nodes?pretty=true'
{
  "ok" : true,
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "njCzrXYaTnKxtqKsMrV-lA" : {
      "name" : "Powderkeg",
      "transport_address" : "inet[/185.19.28.90:9300]",
      "hostname" : "elasticsearch-f76",
      "version" : "0.90.2",
      "http_address" : "inet[/185.19.28.90:9200]"
    },
    "bdqnkSNfTb63eGMM7CUjNA" : {
      "name" : "USAgent",
      "transport_address" : "inet[/185.19.28.92:9300]",
      "hostname" : "elasticsearch-a28",
      "version" : "0.90.2",
      "http_address" : "inet[/185.19.28.92:9200]"
    }
  }

Really cool, have fun !!!

Wednesday, July 03, 2013

Apache Whirr and CloudStack for Big Data in the Clouds

This post is a little more formal than usual as I wrote this for a tutorial on how to run hadoop in the clouds, but I thought this was very useful so I am posting it here for everyone's benefit (hopefully).

When CloudStack graduated from the Apache Incubator in March 2013 it joined Hadoop as a Top-Level Project (TLP) within the Apache Software Foundation (ASF). This made the ASF the only Open Source Foundation which contains a cloud platform and a big data solution. Moreover a closer look at the projects making the entire ASF shows that approximately 30% of the Apache Incubator and 10% of the TLPs is "Big Data" related. Projects such as Hbase, Hive, Pig and Mahout are sub-projects of the Hadoop TLP. Ambari, Kafka, Falcon and Mesos are part of the incubator and all based on Hadoop.

To Complement CloudStack, API wrappers such as Libcloud, deltacloud and jclouds are also part of the ASF. To connect CloudStack and Hadoop two interesting projects are also in the ASF: Apache Whirr a TLP, and Provisionr currently in incubation. Both Whirr and Provisionr aimed at providing an abstraction layer to define big data infrastructure based on Hadoop and instantiate those infrastructure on Clouds, including Apache CloudStack based clouds. This co-existence of CloudStack and the entire Hadoop ecosystem under the same Open Source Foundation means that the same governance, processes and development principles apply to both project bringing great synergy that promises an even better complementarity.

In this tutorial we introduce Apache Whirr, an application that can be used to define, provision and configure big data solutions on CloudStack based clouds. Whirr automatically starts instances in the cloud and boostrapps hadoop on them. It can also add packages such as Hive, Hbase and Yarn for map-reduce jobs.

Whirr [1] is a "set of libraries for running cloud services" and specifically big data services. Whirr is based on jclouds [2]. Jclouds is a java based abstraction layer that provides a common interface to a large set of Cloud Services and providers such as Amazon EC2, Rackspace servers and CloudStack. As such all Cloud providers supported in Jclouds are supported in Whirr. The core contributors of Whirr include four developers from Cloudera the well-known Hadoop distribution. Whirr can also be used as a command line tool, making it straightforward for users to define and provision Hadoop clusters in the Cloud.

As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating /etc/apt/sources.list.d/cloudera.list and putting the following contents in it:

deb [arch=amd64] http://archive.cloudera.com/cdh4/-cdh4 contrib 
deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib

With this repository in place, one can install whirr with:

$sudo apt-get install whirr

The whirr command will now be available. Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:

$git clone git://git.apache.org/whirr.git

They can then build their own version of whirr using maven:

$mvn install

The whirr binary will be located under the /bin directory. Adding it to one's path with:

$export PATH=$PATH:/path/to/whirr/bin

Will make the whirr command available in the user's environment. Successfull installation can be checked by simply entering:

$whirr --help

With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:

PROVIDER=cloudstack
IDENTITY=
CREDENTIAL=
ENDPOINT=

For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute

Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:

---------------------------------------
# Set the name of your hadoop cluster
whirr.cluster-name=hadoop

# Change the name of cluster admin user
whirr.cluster-user=${sys:user.name}

# Change the number of machines in the cluster here
# Below we define one hadoop namenode and 3 hadoop datanode
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker

# Specify which distribution of hadoop you want to use
# Here we choose to use the Cloudera distribution
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop

# Use a specific instance type.
# Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster
whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8

# If you use ssh key pairs to access instances in the cloud
# Specify them like so
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale
whirr.public-key-file=${whirr.private-key-file}.pub

# Specify the template to use for the instances
# This is the uuid of the CloudStack template
whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2
------------------------------------------------------

To launch this Hadoop cluster use the whirr command line:

$whirr launch-cluster --config hadoop.properties

The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.

-------------------
Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437)
<< success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Get:2 http://security.ubuntu.com precise-security Release [49.6 kB]
Hit http://ch.archive.ubuntu.com precise Release.gpg
Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B]
Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B]
Hit http://ch.archive.ubuntu.com precise Release
..../snip/.....
You can log into instances using the following ssh commands:
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.xx.yy.zz
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.zz.zz.rr
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.tt.yy.uu
[hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.ii.oo.pp
-----------

To destroy the cluster from your client do:

$whirr destroy-cluster --config hadoop.properties.

Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:

$ hadoop fs -ls /
Found 5 items
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /hadoop
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /hbase
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /mnt
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /tmp
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /user

Create a directory to put your input data.

$ hadoop fs -mkdir input
$ hadoop fs -ls /user/sebastiengoasguen
Found 1 items
drwxr-xr-x   - sebastiengoasguen supergroup          0 2013-06-21 20:15 /user/sebastiengoasguen/input

Create a test input file and put in the hadoop file system:

$ cat foobar 
this is a test to count the words
$ hadoop fs -put ./foobar input
$ hadoop fs -ls /user/sebastiengoasguen/input
Found 1 items
-rw-r--r--   3 sebastiengoasguen supergroup         34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar

Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Start the map-reduce job:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output
13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1
13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001
13/06/21 20:20:01 INFO mapred.JobClient:  map 0% reduce 0%
13/06/21 20:20:11 INFO mapred.JobClient:  map 100% reduce 0%
13/06/21 20:20:17 INFO mapred.JobClient:  map 100% reduce 33%
13/06/21 20:20:18 INFO mapred.JobClient:  map 100% reduce 100%
13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001
13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32
13/06/21 20:20:22 INFO mapred.JobClient:   File System Counters
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes read=133
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes written=766347
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of write operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes read=157
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes written=50
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of write operations=3
13/06/21 20:20:22 INFO mapred.JobClient:   Job Counters 
13/06/21 20:20:22 INFO mapred.JobClient:     Launched map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Launched reduce tasks=3
13/06/21 20:20:22 INFO mapred.JobClient:     Data-local map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=10956
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=15446
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:   Map-Reduce Framework
13/06/21 20:20:22 INFO mapred.JobClient:     Map input records=1
13/06/21 20:20:22 INFO mapred.JobClient:     Map output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Map output bytes=66
13/06/21 20:20:22 INFO mapred.JobClient:     Input split bytes=123
13/06/21 20:20:22 INFO mapred.JobClient:     Combine input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Combine output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input groups=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce shuffle bytes=109
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Spilled Records=16
13/06/21 20:20:22 INFO mapred.JobClient:     CPU time spent (ms)=1880
13/06/21 20:20:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=469413888
13/06/21 20:20:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5744541696
13/06/21 20:20:22 INFO mapred.JobClient:     Total committed heap usage (bytes)=207687680

And you can finally check the output:

$ hadoop fs -cat output/part-* | head
this 1
to  1
the  1
a  1
count 1
is  1
test 1
words 1

Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.

To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3

Sort it with TeraSort:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4

And then validate the results with TeraValidate:

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate

Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].

[1] http://whirr.apache.org
[2] http://jclouds.incubator.apache.org
[3] http://www.apache.org/dyn/closer.cgi/whirr/
[4] http://exoscale.ch
[5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html
[6] http://wiki.apache.org/hadoop/Virtual%20Hadoop
[7] https://github.com/chiradeep/stackmate

Friday, June 28, 2013

Build a Cloud Paris recap

On June 19th we had a Build a Cloud Day in Paris, 60 of us gathered to learn about Apache CloudStack and ear from CloudStack users, integrators and ecosystem partners. We had four great sponsors that helped make the event possible: iKoula a public cloud provider in Paris, Usharesoft a software provider, Apalia a CloudStack integrator and editor of Amysta a CloudStack usage and metering plugin and OW2 an open source consortium. I finally got all the slides and I embed them in this post for your enjoyment. The day started with an intro about BACD and a presentation of the various talks that we were going to have. I talked about the Apache Software Foundation (ASF) and showed how several Apache projects could be used to build a complete cloud infrastructure.

Ikoula followed with a presentation on how to use CloudStack to build a public cloud. Public clouds are the natural evolution of traditional hosting providers, ikoula offers several cloud services based on a default CloudStack install. With 1,000 servers and approximately 8,000 Virtual machines in the cloud it is a perfect example of a public cloud in production with CloudStack. In the afternoon, Joaqium Dos Santos gave an introductory demo on the CloudStack API and CloudMonkey.

Following a presentation on public cloud, Florent Paillot from INRIA talked about the national continuous integration platform. A built and test system for INRIA researchers. Private cloud used for build and test are a common use case for CloudStack. Florent shared the design and reasoning that went into choosing CloudStack. He also talked about operational details and some issues that he will bring back up to the CloudStack mailing list.

With Public and Private cloud use cases behind us it was time to talk about the software that are available to uses to bring added value and ease of use to their cloud. The first presentation was from UshareSoft, a software that makes image creation and management a breeze. UshareSoft also offers a marketplace ala App Store and a migration service. In the afternoon we had a complete hands-on demo of USharesoft, which boasts a sleek user interface and intuitive image management capabilities.

Activeeon followed just before lunch. Activeeon had already presented at the BACD in Ghent back in January, Brian Amedro the CTO talked about ProActive and its CloudStack Plugin. Proactive is used to manage complex computational workflows in industry such as the financials, pharmaceuticals and video editing. The CloudStack use case was particularly telling: working with a customer they used the CloudStack plugin for video rendering on-demand.

In the fall I had the opportunity to work with a group of students from the Ecole des Mines de Nantes. They developed a CloudStack plugin for BtrCloud, a software that can optimize the placement of virtual machines in the data center. BtrCloud presented the latest development of this plugin, showcasing an embedded user interface and the ability to create various rules (e.g affinity, anti-affinity, spread, pack..) for migration. BtrCloud promises to be very interesting as it could be used for Green IT in the Cloud.

We ended the day with a presentation on cloud usage metering with Amysta. Cloud usage is an internal functionality of CloudStack, which keeps track of all usage in a MySQL database. Amysta extends the core CloudStack metering capability with an embedded interface to create very fine grained usage reports on all resources. Pricing and currency can be set and reports generated in various formats for organizations or individual users.

And that was it in Paris, we will certainly plan another event as everyone was very interested to learn even more. iKoula offered to host a regular CloudStack meet-up, stay tuned for information on that front !