On Optimal Transformer Depth for Low-Resource Language Translation

Elan van Biljon; Arnu Pretorius; Julia Kreutzer

arXiv:2004.04418·cs.CL·April 16, 2020·AfricaNLP·20 cites

On Optimal Transformer Depth for Low-Resource Language Translation

Elan van Biljon, Arnu Pretorius, Julia Kreutzer

PDF

Open Access 1 Repo

TL;DR

This paper investigates the optimal depth of Transformer models for low-resource language translation, demonstrating that smaller models often outperform larger ones in this setting, thus reducing computational costs.

Contribution

The study reveals that low-resource NMT benefits from shallower Transformer models, challenging the trend of using very large models and promoting resource-efficient approaches.

Findings

01

Smaller Transformer models perform better in low-resource NMT.

02

Large models are more difficult to optimize and can hurt performance.

03

Resource-efficient models are effective for community-driven low-resource translation.

Abstract

Transformers have shown great promise as an approach to Neural Machine Translation (NMT) for low-resource languages. However, at the same time, transformer models remain difficult to optimize and require careful tuning of hyper-parameters to be useful in this setting. Many NMT toolkits come with a set of default hyper-parameters, which researchers and practitioners often adopt for the sake of convenience and avoiding tuning. These configurations, however, have been optimized for large-scale machine translation data sets with several millions of parallel sentences for European languages like English and French. In this work, we find that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations. We see our work as complementary to the Masakhane…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ElanVB/optimal_transformer_depth
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax