Data mining of public genomic repositories: harnessing off-target reads to expand microbial pathogen genomic resources
Damien Richard (UMR PHIM), Nils Poulicard (UMR PHIM)

TL;DR
This review discusses how off-target reads in publicly available genomic sequencing data can be mined to expand microbial pathogen resources, revealing genetic diversity, distribution, and ecological interactions.
Contribution
It highlights recent methodological advances enabling efficient data mining of petabase-scale databases for microbial pathogen research.
Findings
Expanded microbial pathogen genetic diversity
Improved understanding of pathogen spatiotemporal distribution
Uncovered previously unrecognized ecological interactions
Abstract
As sequencing technologies become more affordable and genomic databases expand continuously, the reuse of publicly available sequencing data emerges as a powerful strategy for studying microbial pathogens. Indeed, raw sequencing reads generated for the study of a given organism often contain reads originating from the associated microbiota. This review explores how such off-target reads can be detected and used for the study of microbial pathogens. We present genomic data mining as a method to identify relevant sequencing runs from petabase-scale databases, highlighting recent methodological advances that allow efficient database querying. We then briefly outline methods designed to retrieve relevant data and associated metadata, and provide an overview of common downstream analysis pipelines. We discuss how such approaches have (i) expanded the known genetic diversity of microbial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
