Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Yan Pan, Yuanzhi Li

TL;DR
This paper investigates why Adam converges faster than SGD for training transformers, introducing the concept of directional sharpness and proposing coordinate-wise clipping to improve convergence speed.
Contribution
It introduces directional sharpness as a key factor in optimization performance and proposes coordinate-wise clipping to enhance SGD convergence.
Findings
Adam exhibits lower directional sharpness than SGD.
Coordinate-wise clipping reduces sharpness and accelerates convergence.
Clipping improves local loss reduction in scenarios with few problematic coordinates.
Abstract
While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question that why Adam converges significantly faster than SGD in these scenarios. In this paper, we propose one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Neural Network Applications
MethodsAdam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent
