Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, Mark Fishel

TL;DR
This paper introduces efficient methods for adapting pre-trained tokenizers through continued BPE training for vocabulary extension and leaf-based pruning to remove redundancies, improving efficiency and utilization.
Contribution
It presents novel techniques for tokenizer adaptation, including continued BPE training and leaf-based pruning, with an open-source toolkit for practical vocabulary modification.
Findings
Improved tokenization efficiency across multiple languages.
Enhanced utilization of added vocabulary.
Reduced redundant tokens without sacrificing model quality.
Abstract
Tokenizer adaptation plays an important role in adapting pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training that extends a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
