Lack of Fluency is Hurting Your Translation Model
Jaehyo Yoo, Jaewoo Kang

TL;DR
This paper identifies that discrepancies in fluency between training and testing data harm translation model performance, and proposes a method to detect and remove 'fluency noise' to improve translation quality.
Contribution
The work introduces a gradient-based method to detect and eliminate fluency noise in training data, enhancing translation models' performance on benchmark datasets.
Findings
Removing fluency noise improves translation accuracy on WMT-14 datasets.
The method is compatible with back-translation augmentation.
Qualitative analysis reveals key points affecting fluency noise detection.
Abstract
Many machine translation models are trained on bilingual corpus, which consist of aligned sentence pairs from two different languages with same semantic. However, there is a qualitative discrepancy between train and test set in bilingual corpus. While the most train sentences are created via automatic techniques such as crawling and sentence-alignment methods, the test sentences are annotated with the consideration of fluency by human. We suppose this discrepancy in training corpus will yield performance drop of translation model. In this work, we define \textit{fluency noise} to determine which parts of train sentences cause them to seem unnatural. We show that \textit{fluency noise} can be detected by simple gradient-based method with pre-trained classifier. By removing \textit{fluency noise} in train sentences, our final model outperforms the baseline on WMT-14 DEEN and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
