TL;DR
This paper introduces a novel offline reinforcement learning approach to train dialog models using human feedback, addressing exploration and overestimation challenges, and demonstrates improved conversational quality in real-world tests.
Contribution
It develops a new offline RL algorithm with KL-control and pessimistic strategies, enabling effective training of dialog models from static human feedback datasets.
Findings
Significant improvement in dialog quality over existing offline RL methods.
Effective use of implicit human feedback cues as reward signals.
Validated approach with 80 user ratings in open-domain conversations.
Abstract
How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL). We identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions. A well-known challenge is that learning an RL policy in an offline setting usually fails due to the lack of ability to explore and the tendency to make over-optimistic estimates of future reward. These problems become even harder when using RL for language models, which can easily have a 20,000 action vocabulary and many possible reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
