LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang

TL;DR
This paper studies intermediate merge residues in BPE tokenizers, showing they waste capacity and increase vulnerability, and introduces LiteToken to remove these residues, improving robustness and efficiency without harming performance.
Contribution
The paper provides a systematic empirical analysis of merge residues in BPE tokenizers and proposes LiteToken, a simple method to remove these residues, enhancing robustness and efficiency.
Findings
LiteToken reduces token fragmentation and vocabulary size.
Removing residues improves robustness to noisy inputs.
Pretrained models can often adopt LiteToken without fine-tuning.
Abstract
Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
