A memory-efficient data structure representing exact-match overlap graphs with application for next generation DNA assembly
Hieu Dinh, Sanguthevar Rajasekaran

TL;DR
This paper introduces a memory-efficient data structure for representing exact-match overlap graphs, enabling large-scale DNA assembly by significantly reducing storage requirements and maintaining efficient access times.
Contribution
The paper presents a novel compact data structure for exact-match overlap graphs that drastically reduces memory usage while supporting efficient edge access, facilitating large-scale DNA assembly.
Findings
Memory usage is bounded by (2λ -1)(2⌈log n⌉ + ⌈log λ⌉)n bits.
Edge access operation runs in O(log λ) time.
Construction and storage are linear in time and memory, enabling handling of billions of strings.
Abstract
An exact-match overlap graph of given strings of length is an edge-weighted graph in which each vertex is associated with a string and there is an edge of weight if and only if , where is the length of and is a given threshold. In this paper, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most bits with a guarantee that the basic operation of accessing an edge takes time. Exact-match overlap graphs have been broadly used in the context of DNA assembly and the \emph{shortest super string problem} where the number of strings ranges from a couple of thousands to a couple of billions, the length of the strings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Advanced biosensing and bioanalysis techniques · Genomics and Phylogenetic Studies
