AdaptBPE: From General Purpose to Specialized Tokenizers

Vijini Liyanage; Fran\c{c}ois Yvon

arXiv:2601.21665·cs.CL·January 30, 2026

AdaptBPE: From General Purpose to Specialized Tokenizers

Vijini Liyanage, Fran\c{c}ois Yvon

PDF

Open Access 1 Video

TL;DR

AdaptBPE introduces a lightweight post-training method to adapt general-purpose tokenizers for specific domains or languages, improving efficiency and performance in language modeling tasks.

Contribution

It proposes a novel adaptation strategy that selectively replaces low-utility tokens, optimizing tokenization for targeted domains or tasks after initial training.

Findings

01

Adapted tokenizers outperform baseline models in compression efficiency.

02

The method improves performance on generation and classification tasks across multiple languages.

03

The approach acts as a lightweight vocabulary fine-tuning mechanism.

Abstract

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AdaptBPE: From General Purpose to Specialized Tokenizers· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling