GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

TL;DR
GeneZip is a novel region-aware DNA compression framework that improves long-context DNA modeling by effectively allocating representational resources, reducing redundancy, and enabling efficient large-scale pretraining.
Contribution
It introduces a region-aware compression method using static annotations and dynamic routing, achieving state-of-the-art performance and efficiency in long-context DNA modeling.
Findings
GeneZip achieves the best validation perplexity among encoder-based compressors.
It assigns higher BPT to repetitive DNA sequences without supervision.
GeneZip enables longer-context pretraining and faster fine-tuning on DNA tasks.
Abstract
Long-context DNA models are limited by token-mixing cost and by how compression allocates representational budget across the genome. Existing approaches operate close to base-pair resolution, apply fixed downsampling, or learn content-dependent chunks without an explicit genomic budget, making long-context pretraining expensive and difficult to control. We introduce GeneZip, a region-aware DNA compression framework that combines H-Net-style dynamic routing with a Region-Aware Ratio (RAR) objective and bounded routing. GeneZip uses static gene-structure annotations during compression training to specify region-wise base-pairs-per-token (BPT) targets; at inference time, it compresses raw unseen DNA without annotations. GeneZip provides three main benefits. First, it is effective: GeneZip variants achieve the best validation PPL among encoder-based compressors, with GeneZip-70M operating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
