SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Andrei-Valentin T\u{a}nase; Elena Pelican

arXiv:2508.11857·cs.CL·August 26, 2025

SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Andrei-Valentin T\u{a}nase, Elena Pelican

PDF

TL;DR

SupraTok introduces a novel tokenization method that learns multi-word semantic units, improving efficiency and performance across multiple languages and benchmarks, thereby enhancing language model capabilities.

Contribution

The paper presents SupraTok, a new tokenization architecture that extends Byte-Pair Encoding with cross-boundary pattern learning and curriculum strategies, achieving significant efficiency gains and improved downstream task performance.

Findings

01

31% improvement in English tokenization efficiency

02

8.4% performance boost on HellaSWAG

03

9.5% improvement on MMLU benchmarks

Abstract

Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.