Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit
Greg Yang, Etai Littwin

TL;DR
This paper extends the Tensor Programs framework to analyze adaptive optimizers like Adam in wide neural networks, revealing a nonlinear kernel behavior and deriving new limits for various architectures.
Contribution
It introduces NEXORT, a new Tensor Program language, and bra-ket notation, enabling analysis of adaptive optimizers in the infinite-width limit.
Findings
Adaptive optimizers exhibit a nonlinear kernel behavior similar to SGD.
The paper derives neural tangent and maximal update limits for any architecture.
It generalizes previous Tensor Programs results to include adaptive optimizers.
Abstract
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Computational Physics and Python Applications · Stochastic Gradient Optimization Techniques
MethodsAdam
