Tuesday, January 08, 2013

A Mahout Cluster across France and Luxembourg Using CloudStack

Early December I attended the Grid5000 Winter school held at the Ecole des Mines de Nantes (EMN) and organized by Adrien Lebre. Grid 5000 is "a scientific instrument designed to support experiment-driven research in all areas of computer science related to parallel, large-scale or distributed computing and networking ". Basically a large scale testbed to design, build and test distributed systems. The US also have such an infrastructure in academia called FutureGrid. These research infrastructures have become key to enable research in distributed systems approaching scales now seen in the industry rather than test systems on couple machines in a single lab. Currently Grid5000 operates 1195 physical hosts, for a total of 8184 cores across 10 sites.

While in Nantes I met with Alexandra Carpen Amarie an INRIA research engineer who developed an amazing tool G5k campaign. G5k campaign allows any user of Grid5000 (G5k) to book nodes on the infrastructure, deploy machines with bare-metal provisioning and then deploy their favorite Cloud IaaS framework (currently CloudStack, Opennebula and Nimbus). G5k scripts are available via git, of interest are some Chef recipes. Heavily tailored for Alexandra's scripts they could be useful for the CloudStack community. Alexandra held a tutorial on deploying a IaaS and PaaS on G5k. For the tutorial the PaaS was Apache Mahout. Lets' not get into a discussion about whether Mahout is a PaaS or not, the point is that a IaaS can be used to deploy and managed a set of nodes that run Hadoop and Mahout on top, to provide a high level functionality. In this case machine learning algorithms to analyze large data-sets. It was attended by approximately 30 people. How did it work exactly?

One thing about G5k and I believe the French research computing community is that they are very prolific in creating great tools. Unfortunately few people know about them. The clusters of G5k are operated like regular batch processing clusters. A batch scheduler is used to access the nodes. Tool #1: OAR a PBS/MOAB like equivalent. Once the nodes are allocated they are provisioned using Tool #2: Kadeploy a crowbar like equivalent. Of great interest is Tool #3 KaVLAN, a tool to lease the VLANs configured on G5k. While not currently used in Alexandra's G5k campaign, I hope the Apache CloudStack community can start making use of it to test Advanced Zones.

The beauty of G5k campaign is that Alexandra's has hidden most of the complexity of the provisioning and configuring. You only need to write a YAML configuration file for your deployment. Specifying the sites and the number of nodes that you want to run on/at, for example:

deployment:
  engine:
    name: CloudStack
    customization_type: multisiteChef
  walltime: 2:00:00
  sites:
    rennes:
      nodes: 10
      subnet: slash_22=1
    nancy:
      nodes: 10
      subnet: slash_22=1     
    sophia:
      nodes: 10
      subnet: slash_22=1
ssh:
  user: username

Launch your campaign and wait for the nodes to be allocated, provisioned and then configured with your IaaS. Depending on the number of nodes requested, you could have a Cloud working within 20 minutes. You can then interact with it. In the case of CloudStack, using the API, Alexandra developed some wrappers to manage VMs. She did it before CloudMonkey came out and is not needed now even though still a great exercise. Couple days after the tutorial I asked Alexandra to deploy CloudStack across several sites. Within 24 hours I had those snapshots in my inbox. A 5 sites cloud, one basic zone per physical site and 97 physical hosts setup, 800 cores and 100 VMs deployed running Mahout. It took 30 minutes to deploy the nodes, one hour to configure the hosts in CloudStack (serially, Alexandra is working on adding parallel configuration in her tool). The 100 VMs were deployed in roughly 10 minutes. The 3 physical nodes missing were due to bare metal provisioning problems.

The snapshot below shows the infrastructure/zone view of the CloudStack deployment. Five basic zones were configured at Rennes, Toulouse, Nancy, Sophia and Luxembourg. All cities connected via the RENATER fiber network.

Below the infrastructure view, shows five zones, 97 hosts, 10 system VMs (console proxy and secondary storage) and 4 virtual routers (One router was not started at the time of the snapshot).

A small detail that you may have seen from the YAML configuration file is that this is all based on ssh. Access to G5k is via ssh keys and not via a PKI infrastructure. Having worked on TeraGrid. This was a nice surprised. Using PKI across different organizations and managing authorization can be extremely complex. This was a sore point in the TeraGrid. It is also used in the LHC grid with more success but still requires a lot of work. In G5k the user base is smaller and more trusted. SSH keys are distributed among sites using a basic NFS setup on the private RENATER network. This makes it easy for users to access all sites.

Looking ahead, the basic question one might ask is whether it makes sense to run mahout within virtual machines. In cases where the dataset is not very large the use of HDFS as a large scale distributed storage systems is not the issue. Rather the time spent running the machine learning algorithms is. There the cpu overhead of virtualization is the main performance factor. Alexandra pointed me to a paper she wrote on performance of map-reduce in the Cloud. I asked her to do some more analysis specific to a CloudStack based Cloud, stay tuned for the results :).

No comments: