Data Rejuvenation: Exploiting Inactive Training Examples for Neural   Machine Translation

Wenxiang Jiao; Xing Wang; Shilin He; Irwin King; Michael R. Lyu,; Zhaopeng Tu

arXiv:2010.02552·cs.CL·October 7, 2020·6 cites

Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation

Wenxiang Jiao, Xing Wang, Shilin He, Irwin King, Michael R. Lyu,, Zhaopeng Tu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a data rejuvenation method that identifies and re-labels inactive training examples in large-scale NMT datasets, leading to improved model performance and training stability.

Contribution

It proposes a novel framework to exploit inactive examples through re-labeling, enhancing neural machine translation training on large datasets.

Findings

01

Significant performance improvements on WMT14 datasets.

02

Enhanced training stability and faster convergence.

03

Better generalization of final NMT models.

Abstract

Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wxjiao/Data-Rejuvenation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications