eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
Matteo Negri, Marco Turchi, Rajen Chatterjee, Nicola Bertoldi

TL;DR
eSCAPE is a large-scale synthetic corpus designed to improve automatic post-editing of machine translation by providing millions of artificially generated triplets, leading to statistically significant quality improvements.
Contribution
This paper introduces eSCAPE, the largest freely-available synthetic corpus for automatic post-editing, created by translating source texts and using target sides as artificial human post-edits.
Findings
Models trained on eSCAPE improve MT quality significantly.
eSCAPE contains millions of triplets for English-German and English-Italian.
Artificial data enhances post-editing performance in general-domain scenarios.
Abstract
Training models for the automatic correction of machine-translated text usually relies on data consisting of (source, MT, human post- edit) triplets providing, for each source sentence, examples of translation errors with the corresponding corrections made by a human post-editor. Ideally, a large amount of data of this kind should allow the model to learn reliable correction patterns and effectively apply them at test stage on unseen (source, MT) pairs. In practice, however, their limited availability calls for solutions that also integrate in the training process other sources of knowledge. Along this direction, state-of-the-art results have been recently achieved by systems that, in addition to a limited amount of available training data, exploit artificial corpora that approximate elements of the "gold" training instances with automatic translations. Following this idea, we present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
