What is the Alignment Objective of GRPO?
Milan Vojnovic, Se-Young Yun

TL;DR
This paper analyzes the preference aggregation mechanism of the GRPO reinforcement learning algorithm, revealing it differs from standard methods and is related to reverse KL divergence, with implications for AI alignment.
Contribution
It provides a formal characterization of GRPO's stationary policies and clarifies how preference aggregation arises from the reward model and penalty function.
Findings
Preference aggregation in GRPO differs from logarithmic pooling.
The penalty function corresponds to reverse KL divergence.
For groups of size two, preferences relate to pairwise comparisons.
Abstract
In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
MethodsSparse Evolutionary Training
