Model-Aware Tokenizer Transfer
Mykola Haltiuk, Aleksander Smywinski-Pohl

TL;DR
The paper introduces MATT, a method that uses model internals and attention patterns to improve tokenizer transfer for multilingual large language models, especially in low-resource languages.
Contribution
MATT is a novel approach that incorporates attention influence modeling to enhance tokenizer transfer by leveraging model internals, outperforming heuristic methods.
Findings
MATT recovers a large fraction of original model performance quickly.
Incorporating attention behavior improves tokenizer transfer quality.
MATT outperforms heuristic baselines in diverse linguistic settings.
Abstract
Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
