Though it is one of the most significant or rather say it, one of the most appealing problem in the domain of Network Science, but still it suffers from the most primitive flaw – subjectivity. Yes, I am talking about the problem of Community Detection or what some would like to call Clustering. Why did I use the word “subjectivity” ? Because a lot many definitions exist for how a cluster should be? Adding to the problem, there exist various evaluation metric pertaining to one or multiple features of this “definition”. The problem gets worse as usually there is limited or absolutely no ground truth for most of the social network. So, to summarize the problem of finding “optimal” clusters suffers has the following road-blocks:
- No single definition of “optimal”.
- Different metrics exist pertaining to different definitions.
- Algorithms lacking flexibility – their goal to optimize just one metric.
- No ground truth. Subjectivity can exist even in different versions of ground truth.
So what is the solution? I guess, answers to all of these questions was presented in the paper “Deﬁning and Evaluating Network Communities based on Ground-truth” by J.Yang and J.Leskovec published in ICDM 2012. I have been recently involved in quite a few discussions over the evaluation of community detection, how good are the clusterings obtained by the algorithm? And, definitely there was no clear solution, until I got my eyes on this paper, which was truly like nailing the jelly to the wall (and a reliable citation source :p)
The paper talks about comparison of 13 metrics, which they come up with after seeing the commonly used definitions, and for each of the metric, compared how they performed with respect to ground truth.
What were the metrics ?
The 13 metrics comes from 4 classes namely:
- Metric based on internal connectivity
- Metric based on external connectivity
- Metric based on both internal and external connectivity
- Metric based on network model
Metrics based on internal connectivity include internal density, no. of edges inside the cluster, avg degree , fraction over median degree and triangle participation ratio.
Whereas those based on external connectivity include expansion and cut ratio.
Those combining internal and external connectivity are Conductance, Normalized Cut, Maximum ODF, Average ODF, Flake ODF. and the one that is based on network model is Modularity.
How were these metrics evaluated?
The metrics were mainly evaluated on the basis of how they kept up to the four goodness measures defined below with respect to the ground truth community structure:
- Separability – “A good community should be well separated from rest of the network”
- Density – “Good communities are well-connected”
- Cohesiveness – “Good community should be well connected internally”
- Clustering Coefficient – “In a good community structure, nodes in a graph should cluster together to a high degree”
Besides this, metrics were also evaluated on how they responded to the change in the ground truth community structure. Ideally they should not change much with little changes in the community structure, but however they should change drastically with high intensity changes. This test was conducted with the help of Z-Score and 4 models of noisening the community.
The results (as posted in the paper) indicated that generally conductance and triad participation ratio performed better for both of the tests.
The paper seemed a good read and you can also download it from Arxiv.
Besides this, it will be interesting to find out what algorithms perform better with comparison to ground truth. The measures of comparison could be F-Score. But as one of the colleague says it should be about measuring “oranges to oranges” and “apples to apples”… and comes up with a metric poised in the paper “Approximation Clustering without the Approximation” by M.F Balcan, A. Blum and A. Gupta.
The metric presented is as follows:
Suppose there are two clusterings C and C’ (where one of them is the ground truth clustering), we should aim to minimize the distance between these 2 clusterings denoted by dist(C,C’). dist(C,C’) is nothing but the fraction of points on which they disagree when matched from C to C’.
The question is to find a permutation of k-communities in C’ which when matched to k-communities in C minimize the distance between the 2 k-clusterings.
However, the metric on the first look seems like computationally intensive for graphs with high number of communities (say 30,000 communities would mean computing distance 30000! ). The problem looks like finding min-weight perfect matching in a bipartite graph.
Just a disclaimer, All views are my personal opinion (does represent group’s view) and I am really sorry if it tend to hurt someone or are in disagreement with some of the more intellectual minds out there. All results however are the intellectual property of the authors referenced in the post. I do not seek to take credit for any of it.
(A curious mind)