MAD for Robust Reinforcement Learning in Machine Translation
Domenic Donato, Lei Yu, Wang Ling, Chris Dyer

TL;DR
This paper presents MAD, a distributed policy gradient algorithm that improves training stability and generalization in machine translation by using mean absolute deviation and variance reduction strategies.
Contribution
The paper introduces MAD, a novel distributed policy gradient method with variance reduction techniques, outperforming existing reward-aware training methods in machine translation.
Findings
MAD outperforms REINFORCE, MRT, and PPO in stability and generalization.
Policies trained with MAD perform well with greedy and beam search decoding.
The learned policies are sensitive to the reward functions used during training.
Abstract
We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsREINFORCE
