Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kiant\'e, Brantley, Jason D. Lee, Wen Sun

TL;DR
This paper introduces REFUEL, an efficient policy optimization method for multi-turn RLHF in large language models, addressing covariate shift and improving performance on dialogue tasks.
Contribution
REFUEL is a novel approach that frames multi-turn RLHF as regression tasks, using a single model and self-generated data to enhance policy optimization.
Findings
REFUEL outperforms state-of-the-art methods like DPO and REBEL.
Llama-3-8B-it with REFUEL surpasses larger models on multi-turn dialogues.
Theoretically, REFUEL can match any policy within the training set.
Abstract
Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReliability and Maintenance Optimization
MethodsDirect Preference Optimization · Sparse Evolutionary Training
