Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce
Freddie Sunarso, Srikumar Venugopal, Federico Lauro

TL;DR
This paper introduces ScalLoPS, a scalable protein sequence similarity search tool using Locality-Sensitive Hashing and MapReduce, enabling efficient analysis of large metagenomic datasets on cloud platforms.
Contribution
It presents a novel scalable approach combining LSH and MapReduce for protein sequence similarity search, suitable for large metagenomic datasets.
Findings
ScalLoPS approximates BLAST quality in sequence similarity results.
It significantly improves scalability over traditional methods.
The approach is effective on cloud computing resources.
Abstract
Metagenomics is the study of environments through genetic sampling of their microbiota. Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the study of metagenome data is sequence similarity searching which is computationally intensive over large datasets. Tools such as BLAST require large dedicated computing infrastructure to perform such analysis and may not be available to every researcher. In this paper, we propose a novel approach called ScalLoPS that performs searching on protein sequence datasets using LSH (Locality-Sensitive Hashing) that is implemented using the MapReduce distributed framework. ScalLoPS is designed to scale across computing resources sourced from cloud computing providers. We present the design and implementation of ScalLoPS followed by evaluation with datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Genomics and Phylogenetic Studies · Algorithms and Data Compression
