Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and, Volodymyr Kuleshov

TL;DR
Caduceus is a novel DNA language model that incorporates bi-directionality and reverse complementarity equivariance to effectively model long-range genomic interactions, outperforming previous models on various benchmarks.
Contribution
It introduces the first RC equivariant bi-directional long-range DNA language models, extending the Mamba architecture for genomic sequence modeling.
Findings
Caduceus outperforms previous long-range models on downstream benchmarks.
On a challenging variant effect prediction task, Caduceus exceeds larger models without bi-directionality or equivariance.
Caduceus demonstrates the effectiveness of bi-directional and RC equivariant modeling in genomics.
Abstract
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research
