On Optimal Transformer Depth for Low-Resource Language Translation
Elan van Biljon, Arnu Pretorius, Julia Kreutzer

TL;DR
This paper investigates the optimal depth of Transformer models for low-resource language translation, demonstrating that smaller models often outperform larger ones in this setting, thus reducing computational costs.
Contribution
The study reveals that low-resource NMT benefits from shallower Transformer models, challenging the trend of using very large models and promoting resource-efficient approaches.
Findings
Smaller Transformer models perform better in low-resource NMT.
Large models are more difficult to optimize and can hurt performance.
Resource-efficient models are effective for community-driven low-resource translation.
Abstract
Transformers have shown great promise as an approach to Neural Machine Translation (NMT) for low-resource languages. However, at the same time, transformer models remain difficult to optimize and require careful tuning of hyper-parameters to be useful in this setting. Many NMT toolkits come with a set of default hyper-parameters, which researchers and practitioners often adopt for the sake of convenience and avoiding tuning. These configurations, however, have been optimized for large-scale machine translation data sets with several millions of parallel sentences for European languages like English and French. In this work, we find that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations. We see our work as complementary to the Masakhane…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
