Data Augmentation by Concatenation for Low-Resource Translation: A   Mystery and a Solution

Toan Q. Nguyen; Kenton Murray; David Chiang

arXiv:2105.01691·cs.CL·July 5, 2021

Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution

Toan Q. Nguyen, Kenton Murray, David Chiang

PDF

TL;DR

This paper investigates why concatenation improves low-resource neural machine translation, revealing that factors like context diversity and length diversity, rather than discourse context, drive the performance gains.

Contribution

The study identifies the true factors behind concatenation's effectiveness, challenging the assumption that discourse context is the main contributor.

Findings

01

Concatenation improves translation by about +1 BLEU across four language pairs.

02

Discourse context is unlikely the main cause of improvement.

03

Factors like context diversity, length diversity, and position shifting are responsible.

Abstract

In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for the improvement of about +1 BLEU across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.