Parallel and Scalable Precise Clustering for Homologous Protein Discovery
Stuart Byma, Akash Dhasade, Adrian Altenhoff, Christophe Dessimoz,, James R. Larus

TL;DR
This paper introduces ClusterMerge, a parallel algorithm for precise protein clustering that significantly speeds up homologous protein discovery while maintaining high accuracy and scalability.
Contribution
The paper presents ClusterMerge, a novel parallel clustering algorithm that leverages transitive relationships for scalable and efficient homologous protein identification.
Findings
Achieves 99.8% recall of similar pairs compared to full comparison
Attains 604× speedup on 768 cores
Maintains high parallel and distributed scalability
Abstract
This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
