Group Robust Preference Optimization in Reward-free RLHF
Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier, Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic

TL;DR
This paper introduces GRPO, a method for fine-tuning large language models to perform robustly across diverse preference groups by optimizing for worst-case group performance, improving fairness and accuracy.
Contribution
The paper proposes a novel Group Robust Preference Optimization method that enhances LLM alignment to individual group preferences, addressing robustness issues in traditional RLHF.
Findings
Improved worst-group performance in LLM fine-tuning.
Reduced loss imbalance across diverse preference groups.
Enhanced probability accuracy compared to baseline methods.
Abstract
Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms
MethodsALIGN
