Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Charles Goddard; Fernando Fernandes Neto

arXiv:2506.06607·cs.CL·June 10, 2025

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Charles Goddard, Fernando Fernandes Neto

PDF

Open Access

TL;DR

This paper introduces a training-free method using Orthogonal Matching Pursuit to transplant tokenizers in pretrained large language models, enabling effective cross-tokenizer adaptation without retraining.

Contribution

It proposes a novel, training-free approach for tokenizer transplantation in LLMs using sparse linear combinations, improving performance across different tokenization schemes.

Findings

01

OMP outperforms existing zero-shot methods in preserving model performance.

02

The method effectively bridges large tokenizer discrepancies without gradient updates.

03

OMP enables practical applications like cross-tokenizer knowledge transfer and vocabulary adaptation.

Abstract

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--Llama $\to$ Mistral NeMo (12B) and Qwen $\to$ Llama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks