Regressing the Relative Future: Efficient Policy Optimization for   Multi-turn RLHF

Zhaolin Gao; Wenhao Zhan; Jonathan D. Chang; Gokul Swamy; Kiant\'e; Brantley; Jason D. Lee; Wen Sun

arXiv:2410.04612·cs.LG·April 25, 2025

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kiant\'e, Brantley, Jason D. Lee, Wen Sun

PDF

Open Access 1 Repo 2 Models 2 Datasets 1 Video

TL;DR

This paper introduces REFUEL, an efficient policy optimization method for multi-turn RLHF in large language models, addressing covariate shift and improving performance on dialogue tasks.

Contribution

REFUEL is a novel approach that frames multi-turn RLHF as regression tasks, using a single model and self-generated data to enhance policy optimization.

Findings

01

REFUEL outperforms state-of-the-art methods like DPO and REBEL.

02

Llama-3-8B-it with REFUEL surpasses larger models on multi-turn dialogues.

03

Theoretically, REFUEL can match any policy within the training set.

Abstract

Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaolingao/refuel
pytorchOfficial

Models

Datasets

Videos

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF· slideslive

Taxonomy

TopicsReliability and Maintenance Optimization

MethodsDirect Preference Optimization · Sparse Evolutionary Training