Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Ganesh Sapkota; Md Hasibur Rahman

arXiv:2507.18570·cs.CL·July 25, 2025

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Ganesh Sapkota, Md Hasibur Rahman

PDF

Open Access

TL;DR

This paper introduces a hybrid tokenization approach combining 6-mer and BPE methods to improve DNA language models, capturing both local and global sequence features for better prediction accuracy.

Contribution

The study proposes a novel hybrid tokenization strategy that merges k-mer and BPE tokens, enhancing DNA language model performance over existing methods.

Findings

01

Achieved higher next-k-mer prediction accuracy than state-of-the-art models.

02

Demonstrated improved capture of local and global DNA sequence features.

03

Validated the effectiveness of hybrid tokenization in genomic language modeling.

Abstract

This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing