Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling
Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song

TL;DR
This study systematically compares different tokenization and positional encoding methods for Transformer models in DNA sequence analysis, providing practical insights for optimizing model design and performance.
Contribution
It offers a comprehensive evaluation of segmentation and encoding techniques, highlighting BPE's advantages and the effectiveness of RoPE and AliBi in DNA modeling.
Findings
BPE yields more stable and higher performance across tasks.
RoPE effectively captures periodic motifs and long-range dependencies.
Increasing layers from 3 to 12 significantly improves performance, with diminishing returns beyond 12 layers.
Abstract
Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
