Multi-turn Reinforcement Learning from Preference Human Feedback

Lior Shani; Aviv Rosenberg; Asaf Cassel; Oran Lang and; Daniele Calandriello; Avital Zipori; Hila Noga; Orgad Keller and; Bilal Piot; Idan Szpektor; Avinatan Hassidim; Yossi Matias and; R\'emi Munos

arXiv:2405.14655·cs.LG·December 3, 2024·1 cites

Multi-turn Reinforcement Learning from Preference Human Feedback

Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang and, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller and, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias and, R\'emi Munos

PDF

Open Access 1 Repo

TL;DR

This paper introduces novel multi-turn preference-based reinforcement learning methods for aligning large language models, demonstrating improved performance over existing single-turn approaches in dialogue and education environments.

Contribution

It develops a new mirror-descent-based policy optimization algorithm for multi-turn preference RL and proves its convergence, addressing limitations of single-turn preference methods.

Findings

01

Deep RL variant outperforms RLHF baselines in Education Dialogue environment.

02

Algorithm recovers reward-based RL performance using only preference signals.

03

Proven convergence to Nash equilibrium in the tabular setting.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/Education-Dialogue-Dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Reinforcement Learning in Robotics