Hecate: A Modular Genomic Compressor
Kamila Szewczyk, Sven Rahmann

TL;DR
Hecate is a modular, high-performance genomic compressor that outperforms existing tools in speed and compression ratio by employing a flexible, conditional coding architecture tailored for FASTA/FASTQ data streams.
Contribution
Hecate introduces a novel modular framework with multiple codecs and conditional coding strategies, achieving superior speed and compression efficiency for genomic data.
Findings
Outperforms state-of-the-art tools in speed and compression ratio
Provides exact random-access slicing and referential mode
Achieves 2 to 10 times faster compression at similar ratios
Abstract
We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genome Rearrangement Algorithms · Genomics and Phylogenetic Studies
