Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Greg Yang; Etai Littwin

arXiv:2308.01814·cs.LG·August 8, 2023

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Greg Yang, Etai Littwin

PDF

Open Access

TL;DR

This paper extends the Tensor Programs framework to analyze adaptive optimizers like Adam in wide neural networks, revealing a nonlinear kernel behavior and deriving new limits for various architectures.

Contribution

It introduces NEXORT, a new Tensor Program language, and bra-ket notation, enabling analysis of adaptive optimizers in the infinite-width limit.

Findings

01

Adaptive optimizers exhibit a nonlinear kernel behavior similar to SGD.

02

The paper derives neural tangent and maximal update limits for any architecture.

03

It generalizes previous Tensor Programs results to include adaptive optimizers.

Abstract

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Computational Physics and Python Applications · Stochastic Gradient Optimization Techniques

MethodsAdam