Training Tips for the Transformer Model

Martin Popel; Ond\v{r}ej Bojar

arXiv:1804.00247·cs.CL·May 3, 2018

Training Tips for the Transformer Model

Martin Popel, Ond\v{r}ej Bojar

PDF

4 Repos

TL;DR

This paper provides practical training tips and insights for optimizing Transformer models in neural machine translation, focusing on parameters like batch size, learning rate, and multi-GPU scaling to improve quality and efficiency.

Contribution

It offers empirically-backed recommendations for training Transformer models effectively, addressing hardware constraints and parameter tuning strategies.

Findings

01

Optimal batch size and learning rate settings improve translation quality.

02

Scaling to multiple GPUs enhances training efficiency without loss of accuracy.

03

Practical guidelines for checkpoint averaging and sentence length management.

Abstract

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some of the critical parameters that affect the final translation quality, memory usage, training stability and training time, concluding each experiment with a set of recommendations for fellow researchers. In addition to confirming the general mantra "more data and larger models", we address scaling to multiple GPUs and provide practical tips for improved training regarding batch size, learning rate, warmup steps, maximum sentence length and checkpoint averaging. We hope that our observations will allow others to get better results given their particular hardware and data constraints.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax