MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Siyuan Li; Kai Yu; Anna Wang; Zicheng Liu; Chang Yu; Jingbo Zhou; Qirong Yang; Yucheng Guo; Xiaoming Zhang; Stan Z. Li

arXiv:2511.14806·q-bio.GN·November 20, 2025

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li

PDF

Open Access 1 Video

TL;DR

MergeDNA introduces a hierarchical, context-aware genome modeling approach that dynamically adapts tokenization to genomic sequence complexity, improving performance on DNA benchmarks and multi-omics tasks.

Contribution

It proposes a novel hierarchical architecture with dynamic tokenization and context-aware pre-training, addressing variability in genomic sequence complexity.

Findings

01

Outperforms existing tokenization methods on DNA benchmarks

02

Achieves superior results in multi-omics tasks

03

Effective in both fine-tuning and zero-shot settings

Abstract

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MergeDNA: Context-Aware Genome Modeling with Dynamic Tokenization Through Token Merging· underline

Taxonomy

TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Genomics and Chromatin Dynamics