dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu

TL;DR
dnaHNet is a scalable, tokenizer-free genomic model that adaptively compresses DNA sequences, enabling efficient, interpretable predictions and hierarchical biological insights without supervision.
Contribution
It introduces a novel differentiable chunking mechanism for end-to-end genomic modeling, outperforming existing architectures in efficiency and scalability.
Findings
Outperforms leading models in scaling and efficiency.
Achieves over 3x inference speedup compared to Transformers.
Automatically discovers hierarchical biological structures.
Abstract
Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling inference speedup over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
