DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri,, Han Liu

TL;DR
DNABERT-2 introduces an efficient genome foundation model using Byte Pair Encoding for tokenization, significantly reducing computational costs while maintaining high performance, and establishes a comprehensive benchmark for multi-species genome understanding.
Contribution
The paper proposes replacing k-mer tokenization with BPE in genome models, improving efficiency and performance, and introduces the GUE benchmark for standardized genome understanding evaluation.
Findings
DNABERT-2 achieves comparable performance with 21x fewer parameters.
Pre-training requires approximately 92x less GPU time.
BPE tokenization overcomes limitations of k-mer methods.
Abstract
Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗zhihan1996/DNABERT-2-117Mmodel· 78k dl· ♡ 9378k dl♡ 93
- 🤗liminghong/DNABERT-2-117Mmodel· 5 dl5 dl
- 🤗jaandoui/DNABERT2-AttentionExtractedmodel· 5 dl· ♡ 45 dl♡ 4
- 🤗czl/dnabert2model· 184 dl184 dl
- 🤗vivym/DNABERT-2-117Mmodel· 2 dl2 dl
- 🤗metagene-ai/METAGENE-1model· 74 dl· ♡ 2674 dl♡ 26
- 🤗ashalaa/CS224N_DNABERT2model· 3 dl3 dl
- 🤗quietflamingo/dnabert2-no-flashattentionmodel· 1.3k dl· ♡ 11.3k dl♡ 1
- 🤗yangheng/DNABERT-2-117Mmodel· 53 dl53 dl
- 🤗gustoudu81/DNABERT-2-117M-tritonfixmodel· 46 dl· ♡ 146 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Dense Connections · Adam · Residual Connection · Softmax
