Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Sheng Ouyang; Yulan Hu; Ge Chen; Qingyang Li; Fuzheng Zhang; Yong Liu

arXiv:2505.23349·cs.LG·November 12, 2025

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, Yong Liu

PDF

1 Repo

TL;DR

This paper addresses reward unfairness in RLHF by modeling reward distribution as a resource allocation problem, proposing methods to improve fairness without bias-specific design, and demonstrating enhanced alignment with human preferences.

Contribution

It introduces a bias-agnostic, resource allocation perspective to mitigate reward unfairness in RLHF, with two novel methods for fairness regularization and coefficient adjustment.

Findings

01

Improved fairness in reward models and policies.

02

Enhanced alignment with human preferences.

03

Effective mitigation of reward biases.

Abstract

Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shoyua/towards-reward-fairness
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.