Optimizing Transformer for Low-Resource Neural Machine Translation

Ali Araabi; Christof Monz

arXiv:2011.02266·cs.CL·November 5, 2020

Optimizing Transformer for Low-Resource Neural Machine Translation

Ali Araabi, Christof Monz

PDF

TL;DR

This paper investigates how to optimize Transformer models for low-resource neural machine translation, demonstrating that hyper-parameter tuning significantly enhances translation quality in data-scarce scenarios.

Contribution

It provides an analysis of Transformer performance under low-resource conditions and proposes optimized hyper-parameter settings to improve translation quality.

Findings

01

Optimized Transformer improves BLEU scores by up to 7.3 points.

02

Hyper-parameter tuning is crucial for low-resource NMT.

03

Performance varies significantly with different hyper-parameter configurations.

Abstract

Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has become the de facto mainstream architecture, its capability under low-resource conditions has not been fully investigated yet. Our experiments on different subsets of the IWSLT14 training data show that the effectiveness of Transformer under low-resource conditions is highly dependent on the hyper-parameter settings. Our experiments show that using an optimized Transformer for low-resource conditions improves the translation quality up to 7.3 BLEU points compared to using the Transformer default settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Adam · Dropout · Attention Is All You Need · Multi-Head Attention · Layer Normalization · Byte Pair Encoding