Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson,, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard

TL;DR
This paper introduces a novel off-policy batch deep reinforcement learning algorithm tailored for learning from fixed human interaction data in dialog systems, enabling effective offline training without exploration.
Contribution
The paper presents a new class of off-policy batch RL algorithms that leverage pre-trained models, KL-control, and dropout-based uncertainty to improve learning from human interaction data.
Findings
Effective learning from fixed human interaction data in dialog generation
Ability to extract multiple reward functions post-hoc
Significant improvements over prior methods in real-world deployment
Abstract
Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Speech and dialogue systems · Multimodal Machine Learning Applications
MethodsQ-Learning
