How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise   in Machine Translation

Yan Meng; Di Wu; Christof Monz

arXiv:2407.02208·cs.CL·February 10, 2025

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

Yan Meng, Di Wu, Christof Monz

PDF

Open Access 1 Video

TL;DR

This paper introduces a self-correction method for machine translation that leverages the model's increasing self-knowledge to handle noisy, misaligned web-mined data, significantly improving translation quality.

Contribution

It proposes a novel self-correction approach that enhances robustness to real-world data noise in machine translation systems.

Findings

01

Self-correction improves translation accuracy in noisy conditions.

02

The method outperforms traditional noise filtering techniques.

03

Effective on both simulated and real-world noisy datasets.

Abstract

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. With an observation of the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction, an approach that gradually increases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques