Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms
Wonkee Lee, Seong-Hwan Heo, Jong-Hyeok Lee

TL;DR
This paper introduces a novel data-synthesis method for automatic post-editing that uses masked language models to generate error-rich synthetic data, improving model performance.
Contribution
It proposes a noising-based data-synthesis technique with selective corpus interleaving to produce high-quality synthetic training data for APE models.
Findings
Synthetic data improves APE performance significantly.
The proposed method outperforms existing synthetic data generation approaches.
Selective corpus interleaving enhances data quality and model results.
Abstract
Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
