AdaptBPE: From General Purpose to Specialized Tokenizers
Vijini Liyanage, Fran\c{c}ois Yvon

TL;DR
AdaptBPE introduces a lightweight post-training method to adapt general-purpose tokenizers for specific domains or languages, improving efficiency and performance in language modeling tasks.
Contribution
It proposes a novel adaptation strategy that selectively replaces low-utility tokens, optimizing tokenization for targeted domains or tasks after initial training.
Findings
Adapted tokenizers outperform baseline models in compression efficiency.
The method improves performance on generation and classification tasks across multiple languages.
The approach acts as a lightweight vocabulary fine-tuning mechanism.
Abstract
Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
