An Empirical Study of Scaling Laws for Transfer
Matthew Barnett

TL;DR
This paper empirically investigates how transfer learning effectiveness in transformer models depends on the transfer gap, revealing how data scarcity and distribution differences influence transfer performance and cost-effectiveness.
Contribution
It introduces a scaling law incorporating the transfer gap, providing insights into transfer learning efficiency and data allocation strategies across diverse datasets.
Findings
Transfer gap varies significantly across datasets.
Low transfer gap favors pre-training; high gap favors data collection.
Scaling law can guide optimal data and model training strategies.
Abstract
We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks · Imbalanced Data Classification Techniques
