Transfer Learning in Infinite Width Feature Learning Networks
Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

TL;DR
This paper develops a theoretical framework for transfer learning in infinitely wide neural networks, analyzing how pretraining influences generalization in various feature learning regimes and validating findings on multiple datasets.
Contribution
It introduces a comprehensive theory of transfer learning in infinite-width networks, encompassing fine-tuning and joint feature learning, with interpretable performance insights.
Findings
Pretraining improves generalization depending on data and task alignment.
Adaptive kernels depend on source data, labels, and target data after pretraining.
Theoretical predictions are validated on linear, polynomial, and real datasets.
Abstract
We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We…
Peer Reviews
Decision·ICLR 2026 Poster
Transfer learning is an important topic of practical relevance and theoretical progress is needed. The manuscript condenses this complex question down to nicely analytically or semi-analytically tractable settings.
The conciseness of the main part makes it hard to follow the main text. The appendix, however, supplies all details as far as I see (apart from points marked above). Some connections between the idealized settings and the real-world settings are still a bit loose and should be mentioned honestly in the text (see points under "soundness" above).
This is a strong paper which uses powerful technical machinery to extract insights about learning trajectories in a transfer learning setup. It's really nice to have a predictive toy model of this phenomenon -- this is the first result (afaik) that establishes average-case predictions for this feature-learning behavior. The DMFT formalism is quite opaque (to me) but the takeaway messages are clearly communicated.
1) The authors don't treat the case where fine-tuning lazily updates all weights in the network (rather than just the readout). Do we expect quantitatively similar behavior in this regime? 2) The analytically tractable results rely on an isotropic data assumption which does not hold in many interesting settings. Data anisotropy can qualitatively change the behavior of learning algorithms, especially if $\Sigma_{xx}$ and $\Sigma_{xy}$ do not commute. I think it'd be really interesting to understa
This paper considers transfer learning, which is both an important problem in the field and difficult to analyze analytically. The approach of simplifying a complex system to a tractable toy model to build intuition is a valuable and necessary process. As deep learning models become increasingly complex, developing a single theoretical framework that explains their behavior in its entirety becomes improbable. For this reason, this type of work, which provides rigorous insights into a simplified
The primary weakness of the paper lies in its presentation, which, especially in Section 2, is exceptionally dense. This density hinders the paper's accessibility and, more importantly, obscures the key findings and novel contributions of the work. The following are concrete examples where the writing and structure could be improved: 1.The authors assume the audience is familiar with DMFT (to the point that the acronym is not defined in the main text), which may not be the case for many readers
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Domain Adaptation and Few-Shot Learning · Face and Expression Recognition
