# De-Bruijn graph partitioning for scalable and accurate DNA storage processing

**Authors:** Florestan De Moor, Olivier Boullé, Dominique Lavenier

PMC · DOI: 10.1093/bioinformatics/btaf618 · Bioinformatics · 2025-11-09

## TL;DR

This paper introduces a fast and accurate method for processing DNA storage data using de-Bruijn graph partitioning, enabling efficient reconstruction of encoded sequences.

## Contribution

A novel de-Bruijn graph partitioning method is proposed for scalable DNA storage processing, independent of sequencing technology.

## Key findings

- The method achieves high precision and recall on both synthetic and real datasets.
- Processing 89 million reads takes less than a minute on a 32-core server.
- The approach does not require prior knowledge of encoded information in oligonucleotides.

## Abstract

DNA-based data storage offers a compelling solution for long-term, high-density archiving. In this framework, accurately reconstructing high-quality encoded sequences after sequencing is critical, as it directly impacts the design of error-correcting codes optimized for DNA storage. Furthermore, efficient and scalable processing is essential to manage the large volumes of data expected in such applications.

We introduce a novel method based on de-Bruijn graph partitioning, enabling fast and accurate processing of sequencing data regardless of the underlying sequencing technology and without requiring prior knowledge of the information encoded in the oligonucleotides. Evaluated on both synthetic and real datasets, the method achieves excellent precision and recall. It is implemented in C++ within the software ConCluD and optimized for multi-core servers. Our experiments show that a dataset of 89 million reads, corresponding to a 10 GB fasta file, can be fully processed in less than a minute on a standard 32-cores server.

The ConCluD software and the scripts to reproduce the experiments from this paper are available at https://gitlab.inria.fr/pim/org.pim.dnarxiv under the GNU AGPLv3 licence. An archival snapshot of the repository is also provided at https://doi.org/10.5281/zenodo.17160067.

## Full-text entities

- **Diseases:** CRC (MESH:D015179)
- **Chemicals:** oligonucleotide (MESH:D009841), DBGPS (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12619639/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12619639/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12619639/full.md

---
Source: https://tomesphere.com/paper/PMC12619639