Using a Big Data Database to Identify Pathogens in Protein Data Space

Ashley Mae Conard; Stephanie Dodson; Jeremy Kepner; Darrell Ricke

arXiv:1501.05546·cs.DB·January 23, 2015

Using a Big Data Database to Identify Pathogens in Protein Data Space

Ashley Mae Conard, Stephanie Dodson, Jeremy Kepner, Darrell Ricke

PDF

Open Access

TL;DR

This paper investigates leveraging big data database technologies to enhance the speed and accuracy of pathogen identification in large-scale metagenomic DNA sequences by utilizing large sparse associative array representations.

Contribution

It introduces a novel approach that employs big data databases to analyze genetic data in protein space, aiming to improve pathogen detection efficiency and accuracy.

Findings

01

Utilizes big data databases for large-scale genetic data analysis.

02

Employs sparse associative arrays to extract statistical patterns.

03

Aims to reduce false positives and negatives in pathogen identification.

Abstract

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors / false negatives), or scale poorly on large datasets. This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate goal of rapid pathogen identification in patient samples. Our approach uses the abilities of a big data databases to hold large sparse associative array representations of genetic data to extract statistical patterns about the data that can be used in a variety of ways to improve identification algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Advanced Data Storage Technologies