Monday, March 25, 2013

Quantifying the Apache CloudStack Community

CloudStack is now an Apache Top Level Project (TLP) at the Apache Software Foundation (ASF), the announcement just came out. What the incubation period has meant for CloudStack has nothing to do with code maturity. CloudStack was mature and used in enterprise settings before it entered incubation at the ASF. What incubation meant was that CloudStack evolved into an open source community, self-governed by the Apache Way: transparency, meritocracy, respect, non-affiliation and consensus in no particular order. The community has learned and demonstrated that it understands the principles and processes laid by the Apache Software Foundation and that it can now operate more autonomously.

Growing an open source community is challenging, folks who participate come from various backgrounds, may seldom meet and interact mostly via emails, social media, instant messaging. Participants come from all over the world, work in different time-zones and donate their time after (and sometime while) dealing with day jobs and family. Participants rally around a project that they deem interesting, sometime just to land a hand for a few months, or sometimes because their day job requires them to do it. In that very heterogeneous and fluctuating mix, an open source software community emerges, self-governed, sustainable and non-affiliated. In the last 12 months CloudStack has done just that, building a community from the ground up, developing and understanding the principles laid out in our bylaws, adapting -if need be- people's way of developing software, getting to know each other, welcoming new members every day and setting the foundation for a sustainable software.

When growing a community it is fairly natural to want to measure how well we are doing and how healthy the community is. Over the last several months I have started collecting some data to analyze our community, trying to see how we were doing and interacting. I mostly looked at our public mailing lists doing a study similar to the one done about comparing CloudStack, OpenNebula, OpenStack and Ecualyptus. Secretly, this was also a good way for me to sharpen a few skills on BigData, not that big actually but I used MongoDB instead of MySQL so that qualifies as BigData :). Defining membership in an open source community is a challenge since there is no concept of membership, even the concept of contribution is ill-defined. What constitutes a contribution ? Which channels need to be considered ? In the case of ASF for instance, contribution to code may only mean being a committer, but a committer is someone with write access to the code. Just counting committers will leave out all the folks sending patches, doing testing, doing user support, translating documentation, giving talks and so on. Also while at the ASF everything happens on the mailing list, what about IRC channels, social media like Facebook, Twitter and Linkedin, and what about other communities that may arise around a particular software: sub-projects, user groups etc. In this analysis I decided to only look at our public mailing lists but there is more to it than just this data source.

The two figures below show the number of individual contributors measured by unique email addresses used to send messages to the users and developers mailing lists. The red lines represent data from the users and developers mailing list prior to entering incubation at the ASF. The blue lines represent the ASF specific lists. Significant is the impact that the move to the ASF has had on the number of contributors. The developers list has peaked over 200 per month and the users list has peaked over 150 per month so far (figure on the right). The last data point is March (as of March 21st) and numbers will go up by the end of the month. The graph on the right shows the accumulation of contributors, adding all unique email addresses every month into a set. This shows us again that the move to ASF has had a huge impact on the growth rate of the community and that both list grow at relatively the same pace. Adding the accumulated number of contributors to both list and removing duplicates present in both sets, this gives us a magic number of 722 CloudStack contributors to date.

As mentioned earlier these numbers of contributors are different from the number of Apache CloudStack committers. According to our bylaws contributors can become committers and gain write access to the code when invited by the Project Management Committee (PMC). Just today we welcomed three new committers for a current total of 54. This number is again relative in terms of contribution to code, since committers apply patches from contributor who are not yet committers (I know confusing :) ). A quick look at our git repo, shows a current total of 159 code contributors. To that we could also add the 32 contributors who helped with the translation via transifex which may or may not have participated in the mailing lists.

These numbers show a growing community sparked by the move to the ASF. While at the start of incubation the initial set of committers and contributors where only from Citrix who donated the code, we are also seeing a diversification in the number of companies involved. Talking about affiliation is actually a cardinal sin of the Apache Way. At the ASF only individuals matter, but it is fair to say that for CloudStack to be successful we need to see adoption/participation by a diverse set of companies. The two figures below show just that with the number of companies identified by the email domain used by the contributors. This is not perfect since contributors often do not use their work email but gives a good idea of the trend. Similarly to the plots about contributors we plot the number of companies (really email domains). Clearly the growth in the number of companies involved has increased since joining the ASF, we currently see around 50 companies involved in CloudStack every month. Accumulating all these companies we reach almost 200 on both lists. Removing duplicates from both sets we estimate the total number of companies to 272. Even if some of those companies are only represented by a single individual this is still a very strong number that shows great diversification. Interestingly over the last three months we see that the participation in the user list has shown more diversity than on the developer list.

Finally, looking at the cloudstack-commits mailing list we try to identify the number of committers to the code. The figure below shows the number of committers and the number of commits measured by number of emails and unique email addresses used on the commits mailing list. We already mentioned that we had 54 committers but that some of those committers gained their rights based on contribution which could be as diverse as development, user support, translation etc. Therefore it is no surprise to see the number of committers peaking at 35 and the accumulated number reaching 45. The number of commits is growing, reaching 1500 a month during the 4.0 release, easing up during christmas time and peaking again over 2500 per month prior to code freeze of the 4.1 release.

All in all, analyzing lots of emails showed me that moving to ASF has had a tremendous impact on CloudStack, with over 700 contributors, close to 60 committers, 30 translators, over 1 million lines of code, CloudStack is self-governed and here to stay.

No comments:

Post a Comment