SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
Andrei-Valentin T\u{a}nase, Elena Pelican

TL;DR
SupraTok introduces a novel tokenization method that learns multi-word semantic units, improving efficiency and performance across multiple languages and benchmarks, thereby enhancing language model capabilities.
Contribution
The paper presents SupraTok, a new tokenization architecture that extends Byte-Pair Encoding with cross-boundary pattern learning and curriculum strategies, achieving significant efficiency gains and improved downstream task performance.
Findings
31% improvement in English tokenization efficiency
8.4% performance boost on HellaSWAG
9.5% improvement on MMLU benchmarks
Abstract
Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
