TL;DR
This paper addresses reward unfairness in RLHF by modeling reward distribution as a resource allocation problem, proposing methods to improve fairness without bias-specific design, and demonstrating enhanced alignment with human preferences.
Contribution
It introduces a bias-agnostic, resource allocation perspective to mitigate reward unfairness in RLHF, with two novel methods for fairness regularization and coefficient adjustment.
Findings
Improved fairness in reward models and policies.
Enhanced alignment with human preferences.
Effective mitigation of reward biases.
Abstract
Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
