Apache is about building communities of developers and users around an open source software. As such they can be analyzed with social networking tools to identify patterns of communications (communication networks), sub-communities within groups and bring up the most influential nodes. This type of analysis can be done over time using data from the Apache mailing lists.
Since I am a new Apache committer on CloudStack I wanted to have a look at the health of our community and thought a social network analysis (SNA) would do it. A little googling led me to this very nice research paper on SNA of the R mailing lists. I have not done all the analysis mentioned in the paper, especially the content based analysis but I wanted to post my early results.
Methodology: To get the graphs I grabbed the emails archive from Apache. I used Python to load the mbox files into single Mongo collections. I cleaned the data to avoid replications of senders as well as remove JIRA and Review Board entries. Then with a little bit of PyMongo I made the queries and build the graph with NetworkX. Finished up with the graph visualization and calculations using Gephi. Since there are thousands of emails and threads, there is still some work to pre-process the data, avoid duplicates and match individuals to multiple email addresses.
Using Gephi, I manipulated the graphs. Computing the degree of each node (i.e the number of direct connection to other nodes), the betweeness centrality (i.e a measure of how often a node serves as a bridge between the shortest path between two nodes. In other terms: is a node the best "proxy" between two other nodes ?), I then partitioned the graphs with a color code, trying to identify sub-communities. Finally for clarity I filtered nodes by degree. In CloudStack filtering is especially important since the list has grown quite large of late (This may actually be an indirect sign that it is time to split the dev list).
The graph of the cloudstack-dev mailing list can be seen below:
What stands out right away are the largest nodes, or the most influential nodes according to betweeness centrality. Chip, David, Edison, Chiradeep, Hugo, Wido, Alex are all members of the PMC and exhibit a high centrality. Prasanna and Rohit also exhibit a high centrality but are not currently in the PMC. Also of interest is that this graph is valid since CloudStack joined Apache in April 2012, we can identify contributors who are not active currently but once where and thus are still part of the overall communication network. The color code highlights communities within the community. There seems to be 4 to 5 sub-communities (green, blue, red, cyan, yellow, more investigation is necessary to give interesting meanings to these sub-communities. You will also notice that the edges have all the same thickness. This means that they have the same weight. Once two people exchange an email, an edge is drawn between the two nodes. If they communicate again, the edge is not modified. I will add edge weighting in a future study, this will show us "pathways" between community members and will also affect the influence of the nodes.
Update January 22nd: I added weight to the edges. In english this means that everytime two people communicated in a thread I increases their connectidness by 1. The graph below shows edges with a different thickness. Nodes and Edges were filtered to highlight the strongest connection. This clearly shows the "PMC" of ACS.
The graph of the cloudstack users mailing list can be seen below:
What stands out the most in this graph is that some of the PMC members are still influential (Chiradeep, David, Alex and Edison for instance). But new influential nodes have appeared. Most notably: mcirauqui, geoff.higginbottom and ahmad.emneina. Chip Childers is still present but his influence in this users community is much less. Based on this I am ready to campaign for mcirauqui and geoff to become committers, as they are clear contributors of the CloudStack users community :)
For comparison I checked the HDFS dev mailing list (note that this is fairly restrictive since Hadoop is a very large ecosystem with many mailing lists), followed the same process and obtained the following graph. Maybe the HDFS community can help me analyze it and see if this gives the right picture of their dev community :)
I plan to do more work on this. Cleaning the dataset a bit further, studying the community partitioning, and especially building content based graphs. These will allow us to identify communication network on a particular topic. Say you want to learn about SDN support in CloudStack, we could generate the graph and see who are the most "influential" nodes about SDN in CloudStack.