Extreme-scale many-against-many protein similarity search
Oguz Selvitopi, Saliya Ekanayake, Giulia Guidi, Muaaz G. Awan,, Georgios A. Pavlopoulos, Ariful Azad, Nikos Kyrpides, Leonid Oliker,, Katherine Yelick, Ayd{\i}n Bulu\c{c}

TL;DR
This paper demonstrates the use of over 20,000 GPUs to perform large-scale protein similarity searches on datasets with hundreds of millions of proteins, significantly reducing computation time.
Contribution
It introduces novel matrix-based blocking techniques that enable scalable, memory-efficient all-vs-all protein similarity search on supercomputers.
Findings
Completed similarity search on 405 million proteins in under 3.5 hours
Reduced time-to-solution from weeks to hours for large datasets
Developed memory-efficient blocking techniques for distributed computation
Abstract
Similarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We unleash the power of over 20,000 GPUs on the Summit system to perform all-vs-all protein similarity search on one of the largest publicly available datasets with 405 million proteins, in less than 3.5 hours, cutting the time-to-solution for many use cases from weeks. The variability of protein sequence lengths, as well as the sparsity of the space of pairwise comparisons, make this a challenging problem in distributed memory. Due to the need to construct and maintain a data structure holding indices to all other sequences, this application has a huge memory footprint that makes it hard to scale the problem sizes. We overcome this memory limitation by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Genomics and Phylogenetic Studies
