Reward Difference Optimization For Sample Reweighting In Offline RLHF
Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, Cam Tu Nguyen

TL;DR
This paper introduces Reward Difference Optimization (RDO), a novel offline RLHF method that reweighs sample pairs using reward difference coefficients, improving alignment of large language models with human preferences.
Contribution
The paper proposes RDO, a new offline RLHF approach that incorporates reward difference coefficients and a difference model to better capture preference intensities.
Findings
RDO improves alignment accuracy on benchmark datasets.
Enhanced performance in both automatic metrics and human evaluations.
Effective in reweighing sample pairs for better preference modeling.
Abstract
With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraumatic Brain Injury Research · Advanced Radiotherapy Techniques
