Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher, Ivan Vuli\'c, Benjamin Minixhofer

TL;DR
This paper introduces dynamic tokenization for large language models, allowing on-the-fly token boundary decisions to improve efficiency and fairness across languages with minimal performance loss.
Contribution
It proposes a novel dynamic tokenization method that adapts token boundaries during inference, reducing sequence length and improving multilingual fairness in LMs.
Findings
Reduces token sequence length by >20% in encoder models across 14 languages
Achieves up to 17% sequence length reduction in decoder models with minimal performance loss
Enhances inference speed and language fairness in large language models
Abstract
Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the static design and propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text via a subword-merging algorithm inspired by byte-pair encoding. We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. For encoder-style models (e.g., XLM-R), this on average reduces token sequence lengths by >20% across 14 languages while degrading performance by less than 2%. The same method applied to pre-filling and scoring in decoder-style models (e.g., Mistral-7B) results in minimal performance degradation at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
MethodsHyperNetwork · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · XLM-R
