DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences
Xiaoxiao Zhou, Zihan Wang, Jingbo Shang, Yang E. Li

TL;DR
This paper introduces DNAMotifTokenizer, a biologically informed tokenization method for DNA sequences that improves genomic language models by incorporating domain knowledge of DNA motifs, leading to better performance and interpretability.
Contribution
We propose DNAMotifTokenizer, a novel tokenization approach that integrates DNA motif knowledge, outperforming traditional methods like BPE in genomic tasks.
Findings
BPE performs well on small biologically relevant data
Tokenizer choice affects task-specific performance
Knowledge-infused tokenization enhances model interpretability
Abstract
DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers under controlled pretraining budget, evaluating across multiple downstream tasks from five datasets. We find that tokenizer choice induces task-specific trade-offs, and that vocabulary size and tokenizer training data strongly influence the biological knowledge captured. Notably, BPE tokenizers achieve strong performance when trained on smaller but biologically significant data. Building on these insights, we introduce DNAMotifTokenizer, which directly incorporates domain…
Peer Reviews
Decision·Submitted to ICLR 2026
- The core contribution is the hard-coding of biological prior knowledge (transcription factor motifs, TF motifs) into the vocabulary as "tokens," demonstrating performance gains across multiple benchmarks through experiments. - Focuses on the core issue of DNA language models—the impact of tokenization strategies—and conducts systematic and reproducible comparative experiments. - The introduction of biological priors (motifs, cCREs) enhances interpretability, showing consistent gains across mul
- The tokenization relies entirely on external databases (JASPAR, ENCODE), making the approach essentially "manual knowledge injection," which cannot adapt to unknown regions or new species. - Limited Innovation: Using motifs as a vocabulary is an engineering improvement that is insufficient for a theoretical breakthrough. There is inadequate biological interpretative analysis, as the paper does not quantify the impact of motif tokens on the model's internal representations. - About Generaliza
1. The proposed tokenizer is conceptually simple and biologically informed, with clear pseudocode that enhances reproducibility. 2. The experimental design of this study is rigorous as it meticulously isolates the impact of tokenization by matching computational FLOPs, model architecture, and fine-tuning pipelines across all comparisons (GUE, SCREEN, DART-Eval, Genomic Benchmarks, NT Benchmarks).
1. The technical presentation of this work requires further efforts. The paper presents both a benchmark and a new method in a 9-page paper. The direct result is that the benchmark is not comprehensive and the analysis of the method is not enough. 2. The experimental results show small absolute gains, and the variance is not reported, e.g. some improvements are ≤ 0.0005 in absolute terms. 3. The 0-2 bp offset and random tie-breaking are reasonable, but their stability and computational complex
1. Brings explicit biological priors (TF motifs, cRE) into the tokenization step, offering a more interpretable alternative to opaque subword units. 2. Provides multiple ablations on vocabulary size, segmentation strategies, and qualitative motif coverage
1. The paper’s own results indicate k-mer consistency better than BPE, and DNAMotifTokenizer’s average on NT-benchmarks remains notably below k-mer. This undercuts the central narrative that knowledge-injected tokenization improves fundamental understanding. 2. Converting PWMs via a fixed 0.5 threshold and trimming wildcard ends discards degenerate bases and positional uncertainty, which are biologically meaningful. This can fragment genuine motif families and reduce robustness to natural vari
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics · Genomics and Phylogenetic Studies · Machine Learning in Bioinformatics
