Adapting Language Models via Token Translation
Zhili Feng, Tanya Marwah, Nicolo Fusi, David Alvarez-Melis, Lester, Mackey

TL;DR
This paper introduces S2T2, a method to adapt large language models to new domains by training domain-specific tokenizers and translating tokens, improving out-of-domain performance and enabling transfer to larger models.
Contribution
S2T2 is a novel approach that trains domain-specific tokenizers and token translation models, enhancing out-of-domain language modeling and transferability across model sizes.
Findings
S2T2 improves perplexity and compression on out-of-domain protein sequences.
Token translations learned on smaller models transfer effectively to larger models.
S2T2 outperforms direct finetuning with source or target tokenizers.
Abstract
Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
