Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Shaurya Sharthak, Vinayak Pahalwan, Adithya Kamath, Adarsh Shirawalmath

TL;DR
This paper introduces Tokenadapt, a flexible tokenizer transplantation method with novel pre-tokenization learning for Supertokens, significantly improving language model efficiency and performance by reducing retraining and preserving semantics.
Contribution
The paper presents a model-agnostic tokenizer transplantation framework and a new pre-tokenization learning approach for Supertokens, enhancing compression and reducing fragmentation in language models.
Findings
Tokenadapt outperforms baseline methods like Transtokenizer and ReTok in initializing new tokens.
Supertokens achieve significant compression gains, reducing token fragmentation.
Zero-shot perplexity ratios are consistently lower with Tokenadapt across models.
Abstract
Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsBalanced Selection
