RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong; Wei Xiong; Bo Pang; Haoxiang Wang; Han Zhao; Yingbo Zhou,; Nan Jiang; Doyen Sahoo; Caiming Xiong; Tong Zhang

arXiv:2405.07863·cs.LG·November 13, 2024·3 cites

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou,, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

PDF

Open Access 3 Repos 10 Models 1 Datasets

TL;DR

This paper provides a detailed, reproducible workflow for online iterative RLHF using open-source datasets and models, demonstrating state-of-the-art performance on multiple benchmarks with accessible resources.

Contribution

It introduces a practical recipe for online RLHF using proxy preference models, filling a gap in open-source implementations and enabling resource-limited communities to achieve high performance.

Findings

01

Achieved state-of-the-art results on multiple LLM benchmarks.

02

Demonstrated effectiveness of supervised fine-tuning combined with RLHF.

03

Provided open-source models, datasets, and step-by-step guides for reproducibility.

Abstract

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

hendrydong/preference_700K
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Advanced Control Systems Optimization

MethodsSparse Evolutionary Training