DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species   Genome

Zhihan Zhou; Yanrong Ji; Weijian Li; Pratik Dutta; Ramana Davuluri,; Han Liu

arXiv:2306.15006·q-bio.GN·March 20, 2024·158 cites

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri,, Han Liu

PDF

Open Access 5 Repos 10 Models 1 Datasets

TL;DR

DNABERT-2 introduces an efficient genome foundation model using Byte Pair Encoding for tokenization, significantly reducing computational costs while maintaining high performance, and establishes a comprehensive benchmark for multi-species genome understanding.

Contribution

The paper proposes replacing k-mer tokenization with BPE in genome models, improving efficiency and performance, and introduces the GUE benchmark for standardized genome understanding evaluation.

Findings

01

DNABERT-2 achieves comparable performance with 21x fewer parameters.

02

Pre-training requires approximately 92x less GPU time.

03

BPE tokenization overcomes limitations of k-mer methods.

Abstract

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

leannmlindsey/GUE
dataset· 333 dl
333 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Dense Connections · Adam · Residual Connection · Softmax