SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
Dong Liu, Yanxuan Yu

TL;DR
SemToken introduces a semantic-aware tokenization method that reduces token redundancy and enhances efficiency in long-context language modeling without sacrificing accuracy.
Contribution
It proposes a novel semantic-aware tokenization framework that dynamically adjusts token granularity based on semantic density, improving efficiency over traditional frequency-based methods.
Findings
Achieves up to 2.4x reduction in token count
Realizes up to 1.9x speedup in language modeling tasks
Maintains comparable perplexity and accuracy with baseline models
Abstract
Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
