SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Dong Liu; Yanxuan Yu

arXiv:2508.15190·cs.CL·August 22, 2025

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Dong Liu, Yanxuan Yu

PDF

Open Access

TL;DR

SemToken introduces a semantic-aware tokenization method that reduces token redundancy and enhances efficiency in long-context language modeling without sacrificing accuracy.

Contribution

It proposes a novel semantic-aware tokenization framework that dynamically adjusts token granularity based on semantic density, improving efficiency over traditional frequency-based methods.

Findings

01

Achieves up to 2.4x reduction in token count

02

Realizes up to 1.9x speedup in language modeling tasks

03

Maintains comparable perplexity and accuracy with baseline models

Abstract

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare