TL;DR
This paper demonstrates that using reduced precision and large batch training significantly accelerates neural machine translation training, achieving state-of-the-art results in a fraction of the usual time on large GPU clusters.
Contribution
It introduces methods for faster training of neural machine translation models using reduced precision and large batch sizes, enabling near real-time results on large datasets.
Findings
Speedup of nearly 5x in training time on a single 8-GPU machine.
Achieved state-of-the-art BLEU scores on WMT'14 English-German and English-French datasets.
Reduced training time from days to hours while maintaining or improving accuracy.
Abstract
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of Vaswani et al. (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 85 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset. On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
