A resource-frugal probabilistic dictionary and applications in (meta)genomics
Camille Marchet, Antoine Limasset, Lucie Bittner, Pierre, Peterlongo

TL;DR
This paper introduces a scalable, resource-efficient probabilistic dictionary for indexing billions of genomic sequences, enabling new applications in genomics and metagenomics that outperform existing solutions.
Contribution
The paper presents a novel, scalable probabilistic indexing structure that handles billions of elements, with two applications demonstrating its effectiveness in genomics and metagenomics.
Findings
Successfully indexes billions of genomic sequences
Enables new scalable applications in genomics and metagenomics
Outperforms existing indexing solutions in scalability
Abstract
Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Evolutionary Algorithms and Applications · Metaheuristic Optimization Algorithms Research
