How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng, Di Wu, Christof Monz

TL;DR
This paper introduces a self-correction method for machine translation that leverages the model's increasing self-knowledge to handle noisy, misaligned web-mined data, significantly improving translation quality.
Contribution
It proposes a novel self-correction approach that enhances robustness to real-world data noise in machine translation systems.
Findings
Self-correction improves translation accuracy in noisy conditions.
The method outperforms traditional noise filtering techniques.
Effective on both simulated and real-world noisy datasets.
Abstract
The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. With an observation of the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction, an approach that gradually increases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
