eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing

Matteo Negri; Marco Turchi; Rajen Chatterjee; Nicola Bertoldi

arXiv:1803.07274·cs.CL·March 21, 2018·37 cites

eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing

Matteo Negri, Marco Turchi, Rajen Chatterjee, Nicola Bertoldi

PDF

Open Access

TL;DR

eSCAPE is a large-scale synthetic corpus designed to improve automatic post-editing of machine translation by providing millions of artificially generated triplets, leading to statistically significant quality improvements.

Contribution

This paper introduces eSCAPE, the largest freely-available synthetic corpus for automatic post-editing, created by translating source texts and using target sides as artificial human post-edits.

Findings

01

Models trained on eSCAPE improve MT quality significantly.

02

eSCAPE contains millions of triplets for English-German and English-Italian.

03

Artificial data enhances post-editing performance in general-domain scenarios.

Abstract

Training models for the automatic correction of machine-translated text usually relies on data consisting of (source, MT, human post- edit) triplets providing, for each source sentence, examples of translation errors with the corresponding corrections made by a human post-editor. Ideally, a large amount of data of this kind should allow the model to learn reliable correction patterns and effectively apply them at test stage on unseen (source, MT) pairs. In practice, however, their limited availability calls for solutions that also integrate in the training process other sources of knowledge. Along this direction, state-of-the-art results have been recently achieved by systems that, in addition to a limited amount of available training data, exploit artificial corpora that approximate elements of the "gold" training instances with automatic translations. Following this idea, we present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification