Don't Sweep your Learning Rate under the Rug: A Closer Look at   Cross-modal Transfer of Pretrained Transformers

Danielle Rothermel; Margaret Li; Tim Rockt\"aschel; Jakob Foerster

arXiv:2107.12460·cs.LG·July 28, 2021

Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers

Danielle Rothermel, Margaret Li, Tim Rockt\"aschel, Jakob Foerster

PDF

Open Access

TL;DR

This paper critically examines the transferability of pretrained transformers across modalities, emphasizing the importance of proper hyperparameter tuning, especially learning rates, for accurate evaluation of their transfer performance.

Contribution

It reveals that previous claims of frozen pretrained transformers matching training from scratch are artifacts of not tuning learning rates, and demonstrates the necessity of full fine-tuning for transfer success.

Findings

01

Proper learning rate tuning is crucial for transfer performance.

02

Pretrained transformers outperform training from scratch when fully fine-tuned.

03

Hyperparameter tuning affects the robustness of transfer learning conclusions.

Abstract

Self-supervised pre-training of large-scale transformer models on text corpora followed by finetuning has achieved state-of-the-art on a number of natural language processing tasks. Recently, Lu et al. (2021, arXiv:2103.05247) claimed that frozen pretrained transformers (FPTs) match or outperform training from scratch as well as unfrozen (fine-tuned) pretrained transformers in a set of transfer tasks to other modalities. In our work, we find that this result is, in fact, an artifact of not tuning the learning rates. After carefully redesigning the empirical setup, we find that when tuning learning rates properly, pretrained transformers do outperform or match training from scratch in all of our tasks, but only as long as the entire model is finetuned. Thus, while transfer from pretrained language models to other modalities does indeed provide gains and hints at exciting possibilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning