# Making Asynchronous Stochastic Gradient Descent Work for Transformers

**Authors:** Alham Fikri Aji, Kenneth Heafield

arXiv: 1906.03496 · 2021-11-30

## TL;DR

This paper proposes a hybrid asynchronous SGD method that sums multiple updates to improve convergence of Transformer models, enabling faster training without sacrificing quality.

## Contribution

It introduces a novel hybrid approach that combines asynchronous updates with summation, addressing convergence issues in Transformer training with asynchronous SGD.

## Key findings

- Achieves 1.36x faster training in single-node multi-GPU setup.
- Restores convergence behavior of Transformers with asynchronous SGD.
- No impact on model quality with the proposed method.

## Abstract

Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.03496/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1906.03496/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1906.03496/full.md

---
Source: https://tomesphere.com/paper/1906.03496