Efficient k-mer Dataset Compression Using Eulerian Covers of de Bruijn Graphs and BWT
H. Z.Q. Chen, S. Kitaev, X. Lang, A. Pyatkin, and R. Tang

TL;DR
This paper introduces MCTR, a lossless compression algorithm for k-mer datasets using Eulerian covers of de Bruijn graphs and BWT, achieving linear complexity and complete data reconstruction, with performance validated on genomic data.
Contribution
MCTR is the first to combine Eulerian graph covers with BWT for lossless k-mer dataset compression, providing a theoretically grounded, efficient method with proven linear time and space complexity.
Findings
MCTR guarantees lossless reconstruction of k-mer multisets.
MCTR achieves moderate compression ratios on real genomic data.
Full MCTR+BWT pipeline outperforms BWT alone in lossless compression.
Abstract
Transforming an input sequence into its constituent k-mers is a fundamental operation in computational genomics. To reduce storage costs associated with k-mer datasets, we introduce and formally analyze MCTR, a novel two-stage algorithm for lossless compression of the k-mer multiset. Our core method achieves a minimal text representation (W) by computing an optimal Eulerian cover (minimum string count) of the dataset's de Bruijn graph, enabled by an efficient local Eulerization technique. The resulting strings are then further compressed losslessly using the Burrows-Wheeler Transform (BWT). Leveraging de Bruijn graph properties, MCTR is proven to achieve linear time and space complexity and guarantees complete reconstruction of the original k-mer multiset, including frequencies. Using simulated and real genomic data, we evaluated MCTR's performance (list and frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
