Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning
Ryan Shea, Zhou Yu

TL;DR
This paper introduces an offline reinforcement learning framework with a variance-reducing importance sampling method to enhance persona consistency and dialogue quality in open domain chatbots, reducing training costs.
Contribution
It presents a novel offline RL approach combining supervised data training with targeted reward signals and introduces VaRMI importance sampling to improve training stability.
Findings
Improved persona consistency in dialogue agents.
Enhanced dialogue quality according to automatic and human evaluations.
Reduced training costs compared to online RL methods.
Abstract
Maintaining a consistent persona is a key quality for any open domain dialogue system. Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL). However, systems trained with supervised learning often lack consistency as they are never punished for uttering contradictions. Additional training with RL can alleviate some of these issues, however the training process is expensive. Instead, we propose an offline RL framework to improve the persona consistency of dialogue systems. Our framework allows us to combine the advantages of previous methods as we can inexpensively train our model on existing data as in supervised learning, while punishing and rewarding specific utterances as in RL. We also introduce a simple importance sampling method to reduce the variance of importance weights in offline RL training which we call…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Topic Modeling · Persona Design and Applications
