Monday, May 06, 2013

Update on Apache CloudStack Community Analysis

Since my last post on the analysis of the CloudStack community, we have graduated and became a top-level project. It's about time to give an update on what can be seen as a metric of the health of our community.

All the data presented is based on the analysis of the mailing lists, the data is publicly accessible, I have used it previously, just when we graduated March 22nd, in January and back in November when I did some social network analysis. This study was inspired by John Jiang, now working at Eucalyptus, you can read his analysis, note that he moved it to the Eucalyptus website.

Methodology: As explained in previous posts, a Contributor is considered as someone who sent an email to one of the CloudStack mailing lists. This is not to be confused with a Committer which at the ASF is meant to represent someone with write access to the code. Not all code contributors have write access. I identify Companies as the email domain used by the Contributors. This is because Contributors are none-affiliated in the ASF. Obviously it has some limitations as email domains such as can represent different companies. All emails are loaded in a mongodb database and queries are performed to extract the plots that you will see below. We currently have seven mailing lists of varying traffic: announce, users, users-cn, dev, marketing, commits, issues. Note that all JIRA emails are now sent to the issues list. Subscription to these lists and number of messages last month is as follows:

* dev@ 609 subs / ~2600 msgs in Apr
* users@ 782 subs / ~800 msgs in Apr
* issues@ 109 subs / ~2400 msgs in Apr
* commits@ 166 subs / ~3300 msgs in Apr
* marketing@ 85 subs / ~260 msg in Apr
* users-cn@ ~300 subs / ~260 msgs in Apr

Contributors: The plots below show the number of contributors per month since we became an ASF project as well as an accumulation to date. Comparison with traffic prior to joining ASF can be seen in the previous posts. The number of monthly contributors in dev is reaching 225 , while the number of monthly contributors in users is reaching 175. Most notable is that the number of contributors in the users list seem to be closing on the number of contributors in dev. It may indicate a stabilization of the number of developers and an increase in the user base. The accumulation on both lists is now over 500. A comparison of both contributor sets gives us an estimate of 806 for the entire CloudStack community. Of course this does not include people who may only participate in the marketing or announce list, but they are much lower traffic lists. It also does not include participants in the Chinese user lists. This will be in the next post hopefully. From the subscription data listed above you can also see that we have roughly a 30% activity ratio, meaning that 1/3 of the subscribers actually send emails to the lists. Difficult to know if this is a good or bad number, one would need to compare with other ASF projects.

Companies: The plots below show the number of companies contributing on the dev and users lists as well as an accumulation to date. Similarly to the monthly contributor count we are seeing a faster diversification on the users list. This shows that the users list is adding more companies faster than the dev list (~80 and ~60 per month respectively). The accumulation has reached ~230 on the users list and ~190 on the dev list, for a combined 319 total.

Commits and Marketing: The commits list represents the number of committers than modify the code. These committers often apply patches submitted by other contributors without write access to the code. Therefore the plots should not be seen as the total number of code contributors. The plot below shows an increase in the number of Committers just shy of 40. The marketing list is a new list that deals with event planning, its trend is not yet established but the data shows between 50 and 30 contributors per month.

Social Networks: The following graphs shows social networks of the dev and users list. They are aimed at identifying who is most central to the community. It can be used to identify great contributors that should be recognized and be invited to become an Apache committer. Ultimately I want to use it to build a topic based network, so that people searching for a particular subject know who to talk to. I plan to build an interface that would use keywords to dynamically build these graphs and identify the people who contribute to that topic the most. The graphs show the networks for the last four months. Our new Apache CloudStack vice-president Chip Childers is clearly the most central node in the dev list and Ahmad Emneina is the most central contributor on the users list. The size of the nodes is proportional to their centrality and the thickness of the edges shows the strength of the connection between two nodes. Several nodes (contributors) have been filtered to render a readable picture.

Finally, it is important to note that some contributors are not active on the mailing lists. Even though this is an ASF mantra. Specifically we have engaged in a very active translation effort to bring CloudStack to all countries worldwide. Our translation team has 32 members as off the last count. All translation is handled via transifex. I am also working on git analysis to show better information on commits and I pointed out to John Jiang that he used the wrong repository in his latest study. Stay tuned.

No comments:

Post a Comment