DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
Eliatan Niktab, Hardip Patel

TL;DR
DNATok is a GPU-optimized tokenization system that significantly accelerates DNA sequence processing, enabling high-throughput genomic modeling by replacing traditional string processing with byte lookup tables and parallel pipelines.
Contribution
It introduces a GPU-first, vocabulary-agnostic tokenization system that dramatically improves throughput for DNA language models, surpassing existing methods in speed and efficiency.
Findings
Achieves 84-95x higher encoding throughput than Hugging Face baselines.
Reaches up to 1.9x higher host-to-device transfer throughput.
End-to-end streaming attains 1.27-1.84e8 tokens/sec, removing tokenization bottlenecks.
Abstract
Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · DNA and Biological Computing · Genomics and Chromatin Dynamics
