An Empirical Study of Automatic Post-Editing

Xu Zhang; Xiaojun Wan

arXiv:2209.07759·cs.CL·September 19, 2022

An Empirical Study of Automatic Post-Editing

Xu Zhang, Xiaojun Wan

PDF

Open Access

TL;DR

This study investigates how data augmentation methods and data domain affect automatic post-editing (APE) model performance, revealing key factors for improving APE systems and analyzing their limitations on complex translation outputs.

Contribution

The paper provides a comprehensive analysis of artificial data construction, domain relevance, and the challenges faced by state-of-the-art APE models on difficult translation cases.

Findings

01

High-quality artificial corpora enhance APE performance

02

In-domain data improves results, out-of-domain data can hinder

03

Models struggle with long source texts and high-quality MT outputs

Abstract

Automatic post-editing (APE) aims to reduce manual post-editing efforts by automatically correcting errors in machine-translated output. Due to the limited amount of human-annotated training data, data scarcity is one of the main challenges faced by all APE systems. To alleviate the lack of genuine training data, most of the current APE systems employ data augmentation methods to generate large-scale artificial corpora. In view of the importance of data augmentation in APE, we separately study the impact of the construction method of artificial corpora and artificial data domain on the performance of APE models. Moreover, the difficulty of APE varies between different machine translation (MT) systems. We study the outputs of the state-of-art APE model on a difficult APE dataset to analyze the problems in existing APE systems. Primarily, we find that 1) Artificial corpora with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research