Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human   Preferences in Dialog

Natasha Jaques; Asma Ghandeharioun; Judy Hanwen Shen; Craig Ferguson,; Agata Lapedriza; Noah Jones; Shixiang Gu; Rosalind Picard

arXiv:1907.00456·cs.LG·July 9, 2019·131 cites

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson,, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel off-policy batch deep reinforcement learning algorithm tailored for learning from fixed human interaction data in dialog systems, enabling effective offline training without exploration.

Contribution

The paper presents a new class of off-policy batch RL algorithms that leverage pre-trained models, KL-control, and dropout-based uncertainty to improve learning from human interaction data.

Findings

01

Effective learning from fixed human interaction data in dialog generation

02

Ability to extract multiple reward functions post-hoc

03

Significant improvements over prior methods in real-world deployment

Abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

natashamjaques/neural_chat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsQ-Learning