Monday, March 25, 2013

Quantifying the Apache CloudStack Community

CloudStack is now an Apache Top Level Project (TLP) at the Apache Software Foundation (ASF), the announcement just came out. What the incubation period has meant for CloudStack has nothing to do with code maturity. CloudStack was mature and used in enterprise settings before it entered incubation at the ASF. What incubation meant was that CloudStack evolved into an open source community, self-governed by the Apache Way: transparency, meritocracy, respect, non-affiliation and consensus in no particular order. The community has learned and demonstrated that it understands the principles and processes laid by the Apache Software Foundation and that it can now operate more autonomously.

Growing an open source community is challenging, folks who participate come from various backgrounds, may seldom meet and interact mostly via emails, social media, instant messaging. Participants come from all over the world, work in different time-zones and donate their time after (and sometime while) dealing with day jobs and family. Participants rally around a project that they deem interesting, sometime just to land a hand for a few months, or sometimes because their day job requires them to do it. In that very heterogeneous and fluctuating mix, an open source software community emerges, self-governed, sustainable and non-affiliated. In the last 12 months CloudStack has done just that, building a community from the ground up, developing and understanding the principles laid out in our bylaws, adapting -if need be- people's way of developing software, getting to know each other, welcoming new members every day and setting the foundation for a sustainable software.

When growing a community it is fairly natural to want to measure how well we are doing and how healthy the community is. Over the last several months I have started collecting some data to analyze our community, trying to see how we were doing and interacting. I mostly looked at our public mailing lists doing a study similar to the one done about comparing CloudStack, OpenNebula, OpenStack and Ecualyptus. Secretly, this was also a good way for me to sharpen a few skills on BigData, not that big actually but I used MongoDB instead of MySQL so that qualifies as BigData :). Defining membership in an open source community is a challenge since there is no concept of membership, even the concept of contribution is ill-defined. What constitutes a contribution ? Which channels need to be considered ? In the case of ASF for instance, contribution to code may only mean being a committer, but a committer is someone with write access to the code. Just counting committers will leave out all the folks sending patches, doing testing, doing user support, translating documentation, giving talks and so on. Also while at the ASF everything happens on the mailing list, what about IRC channels, social media like Facebook, Twitter and Linkedin, and what about other communities that may arise around a particular software: sub-projects, user groups etc. In this analysis I decided to only look at our public mailing lists but there is more to it than just this data source.

The two figures below show the number of individual contributors measured by unique email addresses used to send messages to the users and developers mailing lists. The red lines represent data from the users and developers mailing list prior to entering incubation at the ASF. The blue lines represent the ASF specific lists. Significant is the impact that the move to the ASF has had on the number of contributors. The developers list has peaked over 200 per month and the users list has peaked over 150 per month so far (figure on the right). The last data point is March (as of March 21st) and numbers will go up by the end of the month. The graph on the right shows the accumulation of contributors, adding all unique email addresses every month into a set. This shows us again that the move to ASF has had a huge impact on the growth rate of the community and that both list grow at relatively the same pace. Adding the accumulated number of contributors to both list and removing duplicates present in both sets, this gives us a magic number of 722 CloudStack contributors to date.

As mentioned earlier these numbers of contributors are different from the number of Apache CloudStack committers. According to our bylaws contributors can become committers and gain write access to the code when invited by the Project Management Committee (PMC). Just today we welcomed three new committers for a current total of 54. This number is again relative in terms of contribution to code, since committers apply patches from contributor who are not yet committers (I know confusing :) ). A quick look at our git repo, shows a current total of 159 code contributors. To that we could also add the 32 contributors who helped with the translation via transifex which may or may not have participated in the mailing lists.

These numbers show a growing community sparked by the move to the ASF. While at the start of incubation the initial set of committers and contributors where only from Citrix who donated the code, we are also seeing a diversification in the number of companies involved. Talking about affiliation is actually a cardinal sin of the Apache Way. At the ASF only individuals matter, but it is fair to say that for CloudStack to be successful we need to see adoption/participation by a diverse set of companies. The two figures below show just that with the number of companies identified by the email domain used by the contributors. This is not perfect since contributors often do not use their work email but gives a good idea of the trend. Similarly to the plots about contributors we plot the number of companies (really email domains). Clearly the growth in the number of companies involved has increased since joining the ASF, we currently see around 50 companies involved in CloudStack every month. Accumulating all these companies we reach almost 200 on both lists. Removing duplicates from both sets we estimate the total number of companies to 272. Even if some of those companies are only represented by a single individual this is still a very strong number that shows great diversification. Interestingly over the last three months we see that the participation in the user list has shown more diversity than on the developer list.

Finally, looking at the cloudstack-commits mailing list we try to identify the number of committers to the code. The figure below shows the number of committers and the number of commits measured by number of emails and unique email addresses used on the commits mailing list. We already mentioned that we had 54 committers but that some of those committers gained their rights based on contribution which could be as diverse as development, user support, translation etc. Therefore it is no surprise to see the number of committers peaking at 35 and the accumulated number reaching 45. The number of commits is growing, reaching 1500 a month during the 4.0 release, easing up during christmas time and peaking again over 2500 per month prior to code freeze of the 4.1 release.

All in all, analyzing lots of emails showed me that moving to ASF has had a tremendous impact on CloudStack, with over 700 contributors, close to 60 committers, 30 translators, over 1 million lines of code, CloudStack is self-governed and here to stay.

Wednesday, March 13, 2013

Security in the Cloud and the CCSK

Search for cloud computing and you will get approximately 190 million results, search for cloud computing security and you will get 120 million results. This is very rough data of course but it gives us an idea that when talking about Cloud, security is a big concern. Go to a conference and talk about Cloud, and you can be certain that one of the big questions you will get asked is "But what about Security ?"

Disclaimer and bias: This question always leaves me pondering, mostly because my personal background and bias always makes me wonder what people are afraid off in the Cloud and what do they see that Cloud brings to bear that is different from any existing distributed systems running over the internet. I am not an enterprise security expert, I used to teach an introductory course on network security, but I have spent my fair share thinking about Clouds especially at the IaaS layer. There, the new technology that could represent a new attack vector is virtualization and I only read about two non-traditional efforts that really challenged the security of virtualization: the controversial bluepill project in 2006 and the cross-VM side channel attack reported by a research group at MIT in 2009 (there are of course more...). Most problems publicly described with IaaS have been with spam and DDOS. Where on one hand cloud providers are being used to send spam and on the other hand cloud providers are victim of DDOS threatening the availability of services.

However, in the fall I had the chance to participate in the DELL in the Clouds Think Tank in London. It is there that I started to understand that what most people where worried about with the Cloud had more to do with legal issues, governance, compliance and contracts than hardcore attacks. Indeed when dealing with a cloud provider you are exposing your data to new risks for the simple fact that it is not under your total control and you need to manage those risks. Moving your data out of your secured premises and putting them in the hands of another party exposes you to new threats. This is the core of information assurance and risk management. Cloud security is therefore more about updating your security guidelines, making sure that you are compliant with the law and being confident that you can respond appropriately to any attack or business continuity issues. Cloud security is less about the fear of a new technology that exposes new attack vectors. The risks may be new to your enterprise but the attacks and vulnerabilities used are not new to the internet.

To learn more and come up with a plan I now point people to the Cloud Security Alliance (CSA) and their guidelines. It is a 176 pages document which coupled with the ENISA cloud security assessment (125 pages :)) forms the basis of the CSA Certificate of Cloud Security Knowledge (CCSK). I have finished reading the CSA guidelines and once I read the ENISA report I will take the CCSK exam.

The CSA guidelines are a set of reports covering fourteen domains of interest to Cloud security. From Governance and Legal Issues to Incident Response and Virtualization (to name a few). One sentence truly resonated with me due to my personal bias explained earlier. It is in the Application Security domain chapter which states: "Cloud-based software applications require a design rigor similar to an application connecting to the raw internet - the security must be provided by the application without any assumptions being made about the external environment" indeed doing the opposite would be one of the fallacies of distributed systems design enunciated by Peter Deutsch from SUN. There lies in my view the biggest risk, thinking that you can take an application that has been designed in-house assuming a secure local network and wanting to move it to the cloud as-is not managing the risks due to the fact that a) the network is not secure b) bandwidth is not infinite c) latency is not zero d) transport has a cost.

Any service, application, provider, data that is accessible over the public internet is being attacked and is subject to risk of being stolen, tempered with, disrupted and even shutdown. This is not fear mongering, it is just a fact and if you design an application or use a service thinking otherwise you will be exposing yourself and not managing risks properly. Similarly to the new cloud software being developed (e.g Hadoop, Cassandra, CloudStack), that are designed assuming failures of components, when moving to the Cloud one needs to assume attacks and unsecured networks. This is not saying that the Cloud is unsecure, this is saying that you need to adopt the proper security posture, a different one than if you have been operating under the -at least perceived- warmth and coziness of a secured local network.

Getting back to the CSA guidelines, the first domain Cloud Architecture is the perfect introduction to Cloud with reference to the NIST definition of cloud computing. It then follows with what I think is the most important section: Governing in the Cloud, it presents risk management as key to an enterprise governance and introduces legal issues and compliance management as it pertains to Cloud. The chapters help to define the proper security posture, defining or updating security policies that will make sense for Cloud use, understanding the assets that will be at risk and understanding if and how compliance will be enforced. As such it is not specific to Cloud computing, it is really best practices of risk management and understanding the contracts being signed with the cloud providers. Will those contracts expose you ? Do providers follow data protection standards ? Are the providers subject to any laws that may expose you (e.g Patriot Act) ? How can you remedy those risks ? Which providers can give you the compliance you need ? To help with these decisions, CSA created the Security Trust and Assurance Registry (STAR). Cloud providers who participate in the registry submit answers to a questionnaire that lists the standard they follow in 99 categories from audit and compliance to operations, business continuity, human resources, forensics. This registry is key to choose cloud providers that will match your security and governance needs.

The last section of the CSA guidelines is about Operating in the Cloud. From Disaster recovery, data center operations, incident response to encryption, authentication and virtualization. This section is not specific to cloud but comes into play in selecting providers to ensure for example, that the provider data center operations matches your requirements. Or to ensure that in case of incidents you will have access to the logs (defining which ones in a contract). I was pleased to see John Kinsella (@johnlkinsella) from Stratosec as one of the authors of the chapter on Application Security. John is a member of the CloudStack Project Management Committer (PMC). I was also happily surprised to see a chapter on Security as a Service something that Mice Xia from tcloudcomputing and a committer on CloudStack has been working on.

To summarize, knowledge is power (a bit cheezy I know). When moving to the Cloud, an enterprise should engage their security experts from the on-set making sure that risk is managed and that everything is in compliance, this is standard information assurance and part of good enterprise governance. When negotiating (or not negotiating) contracts with Cloud Providers the STAR registry can help choose the providers that will best match the requirements of the enterprise. I am not being paid by CSA or SANS, but I would recommend people to get the CCSK certification and probably also a SANS course on Cloud Security. In my opinion the cloud is not less secure than anything else. The Cloud (at least in its public form) is about accessing resources over the network and locating assets off-premises, this intrinsically presents risks but it is manageable risk that needs to be part of a design. You need to design for attacks, test your designs, monitor, counter-strike. Basic warfare. And remember, the network is secure...Oupss...sorry !

Tuesday, March 05, 2013

Activeeon ProActive integrates with Apache CloudStack

Since CloudStack entered incubation at the Apache Software Foundation, there has been lots of work in integrating existing software solutions with Apache CloudStack (ACS). On the networking side we have seen integration with Nicira, BigSwitch BVS, VMWare dvSwitch, Midokura Midonet. On the storage side we have seen integration with Ceph, Riak CS, Caringo and more recently Solidfire. All of these integrations are either already present in the 4.0 release, set for 4.1 at the end of March or in the works for the 4.2 release this summer. Most of these integration efforts need some tight integration with the CloudStack code, developing plugins, writing new classes, potentially defining new orchestration steps, and adding UI interaction. In this post, I want to introduce an integration with Proactive from a french company called Activeeon (@activeeon). They treated ACS as a black box and integrated with it using the default exposed API. A very powerful mechanism to integrate existing solutions and enterprise workflows with a private or public cloud.

Activeeon is a company that originated from an INRIA research lab (INRIA is the leading computer research organization in French , and a CloudStack user in their continuous integration department). One of their solutions, Proactive is an open source software available at the OW2 consortium. Proactive is an advanced workflow manager that combines a powerful IDE, a workflow engine and a resource manager. It aims to take complex computational workflows and ease their execution on distributed resources, such as HPC cluster, desktop grids and clouds. Existing applications using Proactive are from a diverse set of industries such as the financial, biological and automotive industries. Integrating with Clouds allows ProActive to dynamically provision resources to execute a workflow based on a set of pre-defined policies and constraints that are up to the user. Activeeon had developed a Amazon EC2 resource plugin and it made complete sense to integrate with Apache CloudStack either through the EC2 mapping and even directly via the ACS native API. Below I embed slides from Brian Amedro (@brianamedro) describing the integration, as well as a video demoing it live.

Practically speaking, how did the integration happen ? According to @brianamedro it took 1 person five days. They used the ACS sandbox DevCloud to stand up a live ACS cloud. They modified the template used in DevCloud to make it a Proactive nodesource template. To call the API they used an existing library which even though it was created for older CloudStack release still worked with ACS 4.0. The resulting java code was open sourced on the OW2 repository. Looking ahead, Activeeon wants to test the scalability of the CloudStack connector on thousand nodes cloud and add a stronger integration especially in terms of identity management and auto-scaling.

What their work showed me is that out of the box integration of existing middleware systems with CloudStack is quite straightforward. The rich ACS API opens the door for very powerful couplings and a very extensive CloudStack ecosystem. To top it off, I really appreciated that they developed straight up in Open source mode and that their work is available on the OW2 consortium forge.