Why are Adaptive Methods Good for Attention Models?

Jingzhao Zhang; Sai Praneeth Karimireddy; Andreas Veit; Seungyeon Kim,; Sashank J Reddi; Sanjiv Kumar; Suvrit Sra

arXiv:1912.03194·math.OC·October 26, 2020·39 cites

Why are Adaptive Methods Good for Attention Models?

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim,, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

PDF

Open Access 1 Video

TL;DR

This paper investigates why adaptive optimization methods outperform SGD in attention models, highlighting the role of heavy-tailed gradient noise and demonstrating how gradient clipping enhances training, with practical improvements shown on BERT tasks.

Contribution

It provides the first tight convergence bounds for adaptive methods under heavy-tailed noise and introduces ACClip, an adaptive clipping algorithm that improves BERT training.

Findings

01

Heavy-tailed gradient noise affects SGD performance.

02

Gradient clipping mitigates heavy-tailed noise effects.

03

ACClip outperforms existing methods on BERT tasks.

Abstract

While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an \emph{adaptive} coordinate-wise clipping algorithm (ACClip) and demonstrate its superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Why are Adaptive Methods Good for Attention Models?· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM

MethodsLinear Layer · Gradient Clipping · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · WordPiece · Softmax