Infinite Width Models That Work: Why Feature Learning Doesn't Matter as   Much as You Think

Luke Sernau

arXiv:2406.18800·cs.LG·October 25, 2024

Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

Luke Sernau

PDF

Open Access

TL;DR

This paper challenges the belief that feature learning is crucial in neural networks by showing that infinite-width models like NTKs can perform well without it, especially when using advanced optimizers like ADAM.

Contribution

The paper introduces a new infinite width limit based on ADAM-like dynamics, demonstrating that performance gaps can be closed without feature learning.

Findings

01

NTKs can learn relevant features without feature learning.

02

Weak optimizers like SGD partly explain poor performance of infinite models.

03

ADAM-like dynamics improve infinite width model performance.

Abstract

Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This is usually attributed to the absence of feature learning. We show that this explanation is insufficient. Specifically, we show that infinite width NTKs obviate the need for feature learning. They can learn identical behavior by selecting relevant subfeatures from their (infinite) frozen feature vector. Furthermore, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, we show that weak performance is at least partly due to the fact that existing constructions depend on weak optimizers like SGD. We provide a new infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

MethodsNeural Tangent Kernel · Stochastic Gradient Descent