Quantity vs. Quality of Monolingual Source Data in Automatic Text   Translation: Can It Be Too Little If It Is Too Good?

Idris Abdulmumin; Bashir Shehu Galadanci; Garba Aliyu and; Shamsuddeen Hassan Muhammad

arXiv:2410.13783·cs.CL·October 18, 2024

Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Idris Abdulmumin, Bashir Shehu Galadanci, Garba Aliyu and, Shamsuddeen Hassan Muhammad

PDF

TL;DR

This paper investigates the impact of monolingual data quantity and quality on low-resource neural machine translation, finding that selecting high-quality, domain-relevant data can outperform using all available data.

Contribution

It demonstrates that in low-resource NMT, quality-based data selection can be more effective than simply increasing data quantity.

Findings

01

High-quality, domain-relevant monolingual data improves translation performance.

02

Using all available monolingual data can be less effective than selective data.

03

Selective data based on quality can outperform larger, unfiltered datasets.

Abstract

Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach to exploit such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model. Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or closeness to the domain of the test data, than utilizing all of the available data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.