An Empirical Study of Automatic Post-Editing
Xu Zhang, Xiaojun Wan

TL;DR
This study investigates how data augmentation methods and data domain affect automatic post-editing (APE) model performance, revealing key factors for improving APE systems and analyzing their limitations on complex translation outputs.
Contribution
The paper provides a comprehensive analysis of artificial data construction, domain relevance, and the challenges faced by state-of-the-art APE models on difficult translation cases.
Findings
High-quality artificial corpora enhance APE performance
In-domain data improves results, out-of-domain data can hinder
Models struggle with long source texts and high-quality MT outputs
Abstract
Automatic post-editing (APE) aims to reduce manual post-editing efforts by automatically correcting errors in machine-translated output. Due to the limited amount of human-annotated training data, data scarcity is one of the main challenges faced by all APE systems. To alleviate the lack of genuine training data, most of the current APE systems employ data augmentation methods to generate large-scale artificial corpora. In view of the importance of data augmentation in APE, we separately study the impact of the construction method of artificial corpora and artificial data domain on the performance of APE models. Moreover, the difficulty of APE varies between different machine translation (MT) systems. We study the outputs of the state-of-art APE model on a difficult APE dataset to analyze the problems in existing APE systems. Primarily, we find that 1) Artificial corpora with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
