Multi-turn Reinforcement Learning from Preference Human Feedback
Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang and, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller and, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias and, R\'emi Munos

TL;DR
This paper introduces novel multi-turn preference-based reinforcement learning methods for aligning large language models, demonstrating improved performance over existing single-turn approaches in dialogue and education environments.
Contribution
It develops a new mirror-descent-based policy optimization algorithm for multi-turn preference RL and proves its convergence, addressing limitations of single-turn preference methods.
Findings
Deep RL variant outperforms RLHF baselines in Education Dialogue environment.
Algorithm recovers reward-based RL performance using only preference signals.
Proven convergence to Nash equilibrium in the tabular setting.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Reinforcement Learning in Robotics
