DNACHUNKER: Learnable Tokenization for DNA Language Models
Taewon Kim, Jihwan Shin, Hyomin Kim, Youngmok Jung, Jonghoon Lee, Won-Chul Lee, Sungsoo Ahn, Insu Han

TL;DR
DNAChunker is a learnable tokenization model for DNA sequences that adaptively segments genomic data, improving representation and robustness over fixed methods across multiple benchmarks.
Contribution
It introduces a dynamic, learnable segmentation module for DNA language models, enabling context-dependent, variable-length units that enhance performance and biological relevance.
Findings
DNAChunker outperforms fixed-tokenization baselines on five benchmarks.
Segmentation learned by DNAChunker is mutation-resilient and biologically informed.
The model allocates finer granularity to functionally enriched regions.
Abstract
DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
