Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra
Stephanie Dodson, Darrell O. Ricke, Jeremy Kepner, Nelson Chiu, and, Anna Shcherbina

TL;DR
The paper introduces D4RAGenS, a fast and accurate genetic sequence identification algorithm that uses linear algebra and subsampling to handle large genomic datasets efficiently, with applications in biodefense and diagnostics.
Contribution
It presents a novel sequence identification method leveraging D4M and linear algebra, offering two modes for speed and accuracy tradeoffs, suitable for large-scale genomic data analysis.
Findings
Demonstrates high accuracy in pathogen identification
Achieves significant speed improvements over existing methods
Validated on datasets from DTRA contest
Abstract
The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present DRAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The DRAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
