Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
Toan Q. Nguyen, Kenton Murray, David Chiang

TL;DR
This paper investigates why concatenation improves low-resource neural machine translation, revealing that factors like context diversity and length diversity, rather than discourse context, drive the performance gains.
Contribution
The study identifies the true factors behind concatenation's effectiveness, challenging the assumption that discourse context is the main contributor.
Findings
Concatenation improves translation by about +1 BLEU across four language pairs.
Discourse context is unlikely the main cause of improvement.
Factors like context diversity, length diversity, and position shifting are responsible.
Abstract
In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for the improvement of about +1 BLEU across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
