Abstract
Clustering is one of the advanced analytical functions that has an immense potential to transform the knowledge-space when applied particularly to a data-rich domain such as computational biology. The nature of clustering that is of concern in this paper is graph-theoretic - more specifically, given an input graph G(V;E) where vertices represent elements and edges connect elements that are “related” by a certain biological property, the problem is one of identifying tightly-knit (potentially variable-sized) groups where each member of a group is highly related to most, if not all, of the other members of the same group. The computational hardness of the underlying theoretical problem necessitate use of heuristics in practice. In our previous work, we had evaluated the application of a randomized sampling-based heuristic called shingling on unweighted biological graphs. In this paper, we present a new variant of this heuristic that not only extends its application to weighted graph inputs but is better positioned to achieve qualitative gains on unweighted inputs as well. We also present parallel algorithms for this heuristic using the MapReduce paradigm. Experimental results on subsets of a medium-scale real world biological input (containing up to 10.3M vertices and 640M edges), constructed out of protein sequence data collected from metagenomic communities, demonstrate significant qualitative improvements in the reported clustering, both with and without using edge weights. Furthermore, performance studies indicate near-linear scaling on up to 4K cores of a distributed memory supercomputer.