Scaling Neural Machine Translation

Myle Ott; Sergey Edunov; David Grangier; Michael Auli

arXiv:1806.00187·cs.CL·September 6, 2018

Scaling Neural Machine Translation

Myle Ott, Sergey Edunov, David Grangier, Michael Auli

PDF

5 Repos

TL;DR

This paper demonstrates that using reduced precision and large batch training significantly accelerates neural machine translation training, achieving state-of-the-art results in a fraction of the usual time on large GPU clusters.

Contribution

It introduces methods for faster training of neural machine translation models using reduced precision and large batch sizes, enabling near real-time results on large datasets.

Findings

01

Speedup of nearly 5x in training time on a single 8-GPU machine.

02

Achieved state-of-the-art BLEU scores on WMT'14 English-German and English-French datasets.

03

Reduced training time from days to hours while maintaining or improving accuracy.

Abstract

Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of Vaswani et al. (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 85 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset. On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.