HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum, Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli,, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Chris R\'e

TL;DR
HyenaDNA introduces a novel genomic foundation model that leverages Hyena's long-range, low-complexity capabilities to process entire human genomes at single nucleotide resolution, enabling advanced long-range genomic analysis.
Contribution
This work presents HyenaDNA, the first large-scale genomic model with up to 1 million token context length at nucleotide resolution, surpassing previous Transformer models in speed and context length.
Findings
Achieves state-of-the-art results on multiple genomic benchmarks.
Scales sub-quadratically, training up to 160x faster than Transformers.
Enables in-context learning in genomics.
Abstract
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗LongSafari/hyenadna-tiny-1k-seqlenmodel· 118 dl· ♡ 6118 dl♡ 6
- 🤗LongSafari/hyenadna-medium-160k-seqlenmodel· 81 dl· ♡ 281 dl♡ 2
- 🤗LongSafari/hyenadna-large-1m-seqlenmodel· 42 dl· ♡ 3042 dl♡ 30
- 🤗LongSafari/hyenadna-medium-450k-seqlenmodel· 24 dl· ♡ 724 dl♡ 7
- 🤗LongSafari/hyenadna-small-32k-seqlenmodel· 98 dl· ♡ 198 dl♡ 1
- 🤗LongSafari/hyenadna-tiny-1k-seqlen-d256model· 3 dl· ♡ 13 dl♡ 1
- 🤗LongSafari/hyenadna-tiny-16k-seqlen-d128model· 13 dl13 dl
- 🤗LongSafari/hyenadna-small-32k-seqlen-hfmodel· 18k dl· ♡ 218k dl♡ 2
- 🤗LongSafari/hyenadna-medium-160k-seqlen-hfmodel· 1.5k dl· ♡ 41.5k dl♡ 4
- 🤗LongSafari/hyenadna-medium-450k-seqlen-hfmodel· 683 dl· ♡ 2683 dl♡ 2
Videos
Taxonomy
TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings
