Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Shaurya Sharthak; Vinayak Pahalwan; Adithya Kamath; Adarsh Shirawalmath

arXiv:2505.09738·cs.CL·May 16, 2025

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Shaurya Sharthak, Vinayak Pahalwan, Adithya Kamath, Adarsh Shirawalmath

PDF

Open Access 1 Repo

TL;DR

This paper introduces Tokenadapt, a flexible tokenizer transplantation method with novel pre-tokenization learning for Supertokens, significantly improving language model efficiency and performance by reducing retraining and preserving semantics.

Contribution

The paper presents a model-agnostic tokenizer transplantation framework and a new pre-tokenization learning approach for Supertokens, enhancing compression and reducing fragmentation in language models.

Findings

01

Tokenadapt outperforms baseline methods like Transtokenizer and ReTok in initializing new tokens.

02

Supertokens achieve significant compression gains, reducing token fragmentation.

03

Zero-shot perplexity ratios are consistently lower with Tokenadapt across models.

Abstract

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tinycompany-AI/tokenadapt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsBalanced Selection