TL;DR
This paper introduces VHRL, a hierarchical reinforcement learning method that improves open-domain dialog generation by optimizing long-term conversational rewards, leading to more human-like and appropriate interactions.
Contribution
The paper presents a novel hierarchical RL framework that tunes utterance-level embeddings, enabling better long-term reward optimization in dialog models.
Findings
Significant improvements in human evaluation metrics.
Enhanced automatic metrics for dialog quality.
Outperforms state-of-the-art Transformer-based models.
Abstract
Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text. Reinforcement Learning (RL) is a powerful framework that could potentially address these issues, for example by allowing a dialog model to optimize for reducing toxicity and repetitiveness. However, previous approaches which apply RL to open-domain dialog generation do so at the word level, making it difficult for the model to learn proper credit assignment for long-term conversational rewards. In this paper, we propose a novel approach to hierarchical reinforcement learning, VHRL, which uses policy gradients to tune the utterance-level embedding of a variational sequence model. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
