RLHF Workflow: From Reward Modeling to Online RLHF
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou,, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

TL;DR
This paper provides a detailed, reproducible workflow for online iterative RLHF using open-source datasets and models, demonstrating state-of-the-art performance on multiple benchmarks with accessible resources.
Contribution
It introduces a practical recipe for online RLHF using proxy preference models, filling a gap in open-source implementations and enabling resource-limited communities to achieve high performance.
Findings
Achieved state-of-the-art results on multiple LLM benchmarks.
Demonstrated effectiveness of supervised fine-tuning combined with RLHF.
Provided open-source models, datasets, and step-by-step guides for reproducibility.
Abstract
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sfairXC/FsfairX-LLaMA3-RM-v0.1model· 1.8k dl· ♡ 601.8k dl♡ 60
- 🤗RLHFlow/pair-preference-model-LLaMA3-8Bmodel· 86 dl· ♡ 3886 dl♡ 38
- 🤗Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-Rmodel· 32 dl· ♡ 7832 dl♡ 78
- 🤗Salesforce/LLaMA-3-8B-SFR-SFT-Rmodel· 13 dl· ♡ 813 dl♡ 8
- 🤗Salesforce/LLaMA-3-8B-SFR-RM-Rmodel· 4 dl· ♡ 114 dl♡ 11
- 🤗qwp4w3hyb/SFR-Iterative-DPO-LLaMA-3-8B-R-iMat-GGUFmodel· 17 dl· ♡ 217 dl♡ 2
- 🤗RLHFlow/LLaMA3-iterative-DPO-finalmodel· 97 dl· ♡ 4197 dl♡ 41
- 🤗RLHFlow/LLaMA3-SFTmodel· 78 dl· ♡ 1078 dl♡ 10
- 🤗TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-Rmodel· 262 dl· ♡ 1262 dl♡ 1
- 🤗sirovub/SFR-Iterative-DPO-LLaMA-3-8B-R-GGUFmodel· 15 dl· ♡ 115 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Advanced Control Systems Optimization
MethodsSparse Evolutionary Training
