Feature Learning in Infinite-Width Neural Networks
Greg Yang, Edward J. Hu

TL;DR
This paper introduces modified neural network parametrizations that enable feature learning in the infinite-width limit, surpassing NTK and finite-width models, with explicit formulas derived using Tensor Programs and validated on key tasks.
Contribution
It proposes simple modifications to standard parametrizations to allow feature learning at infinite width, deriving explicit formulas and demonstrating improved performance on canonical tasks.
Findings
Modified parametrizations enable feature learning in the infinite-width limit.
Explicit formulas for these limits are derived using Tensor Programs.
Infinite-width models with feature learning outperform NTK and finite-width networks.
Abstract
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the *Tensor Programs* technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
W&B Deep Learning Salon - Greg Yang· youtube
Greg Yang on Feature Learning in Infinite-Width Networks· youtube
Taxonomy
TopicsTopic Modeling · Tensor decomposition and applications · Stochastic Gradient Optimization Techniques
MethodsLinear Layer · Neural Tangent Kernel · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Attention Dropout · Weight Decay · Attention Is All You Need · Multi-Head Attention
