Transfer Learning in Infinite Width Feature Learning Networks

Clarissa Lauditi; Blake Bordelon; Cengiz Pehlevan

arXiv:2507.04448·cs.LG·February 25, 2026

Transfer Learning in Infinite Width Feature Learning Networks

Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

PDF

Open Access 3 Reviews

TL;DR

This paper develops a theoretical framework for transfer learning in infinitely wide neural networks, analyzing how pretraining influences generalization in various feature learning regimes and validating findings on multiple datasets.

Contribution

It introduces a comprehensive theory of transfer learning in infinite-width networks, encompassing fine-tuning and joint feature learning, with interpretable performance insights.

Findings

01

Pretraining improves generalization depending on data and task alignment.

02

Adaptive kernels depend on source data, labels, and target data after pretraining.

03

Theoretical predictions are validated on linear, polynomial, and real datasets.

Abstract

We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

Transfer learning is an important topic of practical relevance and theoretical progress is needed. The manuscript condenses this complex question down to nicely analytically or semi-analytically tractable settings.

Weaknesses

The conciseness of the main part makes it hard to follow the main text. The appendix, however, supplies all details as far as I see (apart from points marked above). Some connections between the idealized settings and the real-world settings are still a bit loose and should be mentioned honestly in the text (see points under "soundness" above).

Reviewer 02Rating 8Confidence 4

Strengths

This is a strong paper which uses powerful technical machinery to extract insights about learning trajectories in a transfer learning setup. It's really nice to have a predictive toy model of this phenomenon -- this is the first result (afaik) that establishes average-case predictions for this feature-learning behavior. The DMFT formalism is quite opaque (to me) but the takeaway messages are clearly communicated.

Weaknesses

1) The authors don't treat the case where fine-tuning lazily updates all weights in the network (rather than just the readout). Do we expect quantitatively similar behavior in this regime? 2) The analytically tractable results rely on an isotropic data assumption which does not hold in many interesting settings. Data anisotropy can qualitatively change the behavior of learning algorithms, especially if $\Sigma_{xx}$ and $\Sigma_{xy}$ do not commute. I think it'd be really interesting to understa

Reviewer 03Rating 4Confidence 3

Strengths

This paper considers transfer learning, which is both an important problem in the field and difficult to analyze analytically. The approach of simplifying a complex system to a tractable toy model to build intuition is a valuable and necessary process. As deep learning models become increasingly complex, developing a single theoretical framework that explains their behavior in its entirety becomes improbable. For this reason, this type of work, which provides rigorous insights into a simplified

Weaknesses

The primary weakness of the paper lies in its presentation, which, especially in Section 2, is exceptionally dense. This density hinders the paper's accessibility and, more importantly, obscures the key findings and novel contributions of the work. The following are concrete examples where the writing and structure could be improved: 1.The authors assume the audience is familiar with DMFT (to the point that the acronym is not defined in the main text), which may not be the case for many readers

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Domain Adaptation and Few-Shot Learning · Face and Expression Recognition