EvoLen: Evolution-Guided Tokenization for DNA Language Model
Nan Huang, Xiaoxiao Zhou, Junxia Cui, Mario Tapia-Pacheco, Tiffany Amariuta, Yang Li, Jingbo Shang

TL;DR
EvoLen introduces an evolution-guided tokenization method for DNA language models that enhances the preservation of functional motifs by integrating evolutionary signals into the tokenization process.
Contribution
It proposes a novel tokenization approach that incorporates evolutionary information, improving biological relevance and interpretability of DNA sequence representations.
Findings
EvoLen better preserves functional sequence motifs.
It improves differentiation across genomic contexts.
It matches or outperforms standard BPE in benchmarks.
Abstract
Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
