Using a Big Data Database to Identify Pathogens in Protein Data Space
Ashley Mae Conard, Stephanie Dodson, Jeremy Kepner, Darrell Ricke

TL;DR
This paper investigates leveraging big data database technologies to enhance the speed and accuracy of pathogen identification in large-scale metagenomic DNA sequences by utilizing large sparse associative array representations.
Contribution
It introduces a novel approach that employs big data databases to analyze genetic data in protein space, aiming to improve pathogen detection efficiency and accuracy.
Findings
Utilizes big data databases for large-scale genetic data analysis.
Employs sparse associative arrays to extract statistical patterns.
Aims to reduce false positives and negatives in pathogen identification.
Abstract
Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors / false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate goal of rapid pathogen identification in patient samples. Our approach uses the abilities of a big data databases to hold large sparse associative array representations of genetic data to extract statistical patterns about the data that can be used in a variety of ways to improve identification algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Advanced Data Storage Technologies
