Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan, Luo

TL;DR
This paper explains why Adam outperforms SGD on Transformers by analyzing the Hessian spectrum, revealing that block heterogeneity in the Hessian spectrum hampers SGD's performance, which Adam handles better due to coordinate-wise learning rates.
Contribution
The paper introduces the concept of block heterogeneity in the Hessian spectrum and demonstrates its impact on optimizer performance, providing a Hessian-based explanation for Adam's superiority on Transformers.
Findings
Transformers exhibit significant block heterogeneity in the Hessian spectrum.
SGD performs worse than Adam in problems with high block heterogeneity.
Using coordinate-wise learning rates can mitigate the limitations of SGD.
Abstract
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpace Science and Extraterrestrial Life
MethodsAdam · Stochastic Gradient Descent
