Learning to Generalize from Sparse and Underspecified Rewards
Rishabh Agarwal, Chen Liang, Dale Schuurmans, Mohammad Norouzi

TL;DR
This paper introduces Meta Reward Learning (MeRL), a method that constructs auxiliary reward functions to improve learning from sparse, underspecified success-failure feedback, achieving state-of-the-art results in weakly-supervised semantic parsing.
Contribution
The paper proposes MeRL, a novel approach that optimizes auxiliary reward functions to enhance generalization and exploration in sparse reward settings, outperforming Bayesian Optimization-based methods.
Findings
MeRL outperforms Bayesian Optimization in reward learning.
Achieves state-of-the-art on WikiTableQuestions and WikiSQL.
Improves previous results by 1.2% and 2.4%.
Abstract
We consider the problem of learning from sparse and underspecified rewards, where an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. Such success-failure rewards are often underspecified: they do not distinguish between purposeful and accidental success. Generalization from underspecified rewards hinges on discounting spurious trajectories that attain accidental success, while learning from sparse feedback requires effective exploration. We address exploration by using a mode covering direction of KL divergence to collect a diverse set of successful trajectories, followed by a mode seeking KL divergence to train a robust policy. We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
MethodsModel-Agnostic Meta-Learning · Meta Reward Learning
