Lack of Fluency is Hurting Your Translation Model

Jaehyo Yoo; Jaewoo Kang

arXiv:2205.11826·cs.CL·May 25, 2022

Lack of Fluency is Hurting Your Translation Model

Jaehyo Yoo, Jaewoo Kang

PDF

Open Access

TL;DR

This paper identifies that discrepancies in fluency between training and testing data harm translation model performance, and proposes a method to detect and remove 'fluency noise' to improve translation quality.

Contribution

The work introduces a gradient-based method to detect and eliminate fluency noise in training data, enhancing translation models' performance on benchmark datasets.

Findings

01

Removing fluency noise improves translation accuracy on WMT-14 datasets.

02

The method is compatible with back-translation augmentation.

03

Qualitative analysis reveals key points affecting fluency noise detection.

Abstract

Many machine translation models are trained on bilingual corpus, which consist of aligned sentence pairs from two different languages with same semantic. However, there is a qualitative discrepancy between train and test set in bilingual corpus. While the most train sentences are created via automatic techniques such as crawling and sentence-alignment methods, the test sentences are annotated with the consideration of fluency by human. We suppose this discrepancy in training corpus will yield performance drop of translation model. In this work, we define \textit{fluency noise} to determine which parts of train sentences cause them to seem unnatural. We show that \textit{fluency noise} can be detected by simple gradient-based method with pre-trained classifier. By removing \textit{fluency noise} in train sentences, our final model outperforms the baseline on WMT-14 DE $\to$ EN and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification