Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with   MxDNA

Lifeng Qiao; Peng Ye; Yuchen Ren; Weiqiang Bai; Chaoqi Liang; Xinzhu; Ma; Nanqing Dong; Wanli Ouyang

arXiv:2412.13716·q-bio.GN·December 19, 2024

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu, Ma, Nanqing Dong, Wanli Ouyang

PDF

Open Access 1 Repo

TL;DR

MxDNA introduces an adaptive, model-learned DNA tokenization framework that outperforms traditional methods, capturing genomic features more effectively with less data and time, and providing new insights into DNA sequence modeling.

Contribution

The paper presents MxDNA, a novel framework where the model autonomously learns DNA tokenization strategies using gradient descent, tailored for genomic sequences' unique properties.

Findings

01

MxDNA achieves superior performance on benchmark datasets.

02

It requires less pretraining data and time than existing methods.

03

MxDNA learns a distinct tokenization strategy capturing genomic functionalities.

Abstract

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiaoqiaolf/mxdna
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms

MethodsLinear Layer · ADaptive gradient method with the OPTimal convergence rate · Dropout · Convolution · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection