Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Jialu Wang; Heinrich Peters; Asad A. Butt; Navid Hashemi; Alireza Hashemi; Pouya M. Ghari; Joseph Hoover; James Rae; Morteza Dehghani

arXiv:2603.10009·cs.LG·March 12, 2026

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani

PDF

Open Access

TL;DR

This paper introduces P-GRPO, a novel reinforcement learning framework that improves alignment of large language models with diverse individual preferences by normalizing advantages against preference-specific reward histories.

Contribution

P-GRPO decouples advantage estimation from batch statistics, enabling better learning of heterogeneous preferences compared to standard GRPO.

Findings

01

P-GRPO converges faster than standard GRPO.

02

P-GRPO achieves higher reward scores.

03

P-GRPO better captures diverse preferences.

Abstract

Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification