Feature Learning in Infinite-Width Neural Networks

Greg Yang; Edward J. Hu

arXiv:2011.14522·cs.LG·July 18, 2022·20 cites

Feature Learning in Infinite-Width Neural Networks

Greg Yang, Edward J. Hu

PDF

Open Access 4 Repos 2 Videos

TL;DR

This paper introduces modified neural network parametrizations that enable feature learning in the infinite-width limit, surpassing NTK and finite-width models, with explicit formulas derived using Tensor Programs and validated on key tasks.

Contribution

It proposes simple modifications to standard parametrizations to allow feature learning at infinite width, deriving explicit formulas and demonstrating improved performance on canonical tasks.

Findings

01

Modified parametrizations enable feature learning in the infinite-width limit.

02

Explicit formulas for these limits are derived using Tensor Programs.

03

Infinite-width models with feature learning outperform NTK and finite-width networks.

Abstract

As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the *Tensor Programs* technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

W&B Deep Learning Salon - Greg Yang· youtube

Greg Yang on Feature Learning in Infinite-Width Networks· youtube

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Stochastic Gradient Optimization Techniques

MethodsLinear Layer · Neural Tangent Kernel · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Attention Dropout · Weight Decay · Attention Is All You Need · Multi-Head Attention