MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li

TL;DR
MergeDNA introduces a hierarchical, context-aware genome modeling approach that dynamically adapts tokenization to genomic sequence complexity, improving performance on DNA benchmarks and multi-omics tasks.
Contribution
It proposes a novel hierarchical architecture with dynamic tokenization and context-aware pre-training, addressing variability in genomic sequence complexity.
Findings
Outperforms existing tokenization methods on DNA benchmarks
Achieves superior results in multi-omics tasks
Effective in both fine-tuning and zero-shot settings
Abstract
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Genomics and Chromatin Dynamics
