Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
Zhengxuan Wu, Alex Tamkin, Isabel Papadimitriou

TL;DR
This paper systematically investigates the challenges of cross-lingual transfer learning by controlling various factors like syntax and vocabulary, revealing that vocabulary misalignment and embedding re-initialization significantly hinder performance.
Contribution
It introduces controlled transfer studies to isolate the impact of different language variation axes on transfer learning performance, providing new insights into transfer difficulties.
Findings
Models recover from syntactic shifts but not vocabulary misalignment.
Vocabulary misalignment and embedding re-initialization cause persistent performance drops.
High-quality tokenizers do not ease vocabulary alignment issues.
Abstract
When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
