Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang; Congliang Chen; Tian Ding; Ziniu Li; Ruoyu Sun; Zhi-Quan; Luo

arXiv:2402.16788·cs.LG·October 22, 2024·2 cites

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan, Luo

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper explains why Adam outperforms SGD on Transformers by analyzing the Hessian spectrum, revealing that block heterogeneity in the Hessian spectrum hampers SGD's performance, which Adam handles better due to coordinate-wise learning rates.

Contribution

The paper introduces the concept of block heterogeneity in the Hessian spectrum and demonstrates its impact on optimizer performance, providing a Hessian-based explanation for Adam's superiority on Transformers.

Findings

01

Transformers exhibit significant block heterogeneity in the Hessian spectrum.

02

SGD performs worse than Adam in problems with high block heterogeneity.

03

Using coordinate-wise learning rates can mitigate the limitations of SGD.

Abstract

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Why Transformers Need Adam: A Hessian Perspective· slideslive

Taxonomy

TopicsSpace Science and Extraterrestrial Life

MethodsAdam · Stochastic Gradient Descent