Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
Ganesh Sapkota, Md Hasibur Rahman

TL;DR
This paper introduces a hybrid tokenization approach combining 6-mer and BPE methods to improve DNA language models, capturing both local and global sequence features for better prediction accuracy.
Contribution
The study proposes a novel hybrid tokenization strategy that merges k-mer and BPE tokens, enhancing DNA language model performance over existing methods.
Findings
Achieved higher next-k-mer prediction accuracy than state-of-the-art models.
Demonstrated improved capture of local and global DNA sequence features.
Validated the effectiveness of hybrid tokenization in genomic language modeling.
Abstract
This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing
