What is the Alignment Objective of GRPO?

Milan Vojnovic; Se-Young Yun

arXiv:2502.18548·cs.LG·March 14, 2025

What is the Alignment Objective of GRPO?

Milan Vojnovic, Se-Young Yun

PDF

Open Access

TL;DR

This paper analyzes the preference aggregation mechanism of the GRPO reinforcement learning algorithm, revealing it differs from standard methods and is related to reverse KL divergence, with implications for AI alignment.

Contribution

It provides a formal characterization of GRPO's stationary policies and clarifies how preference aggregation arises from the reward model and penalty function.

Findings

01

Preference aggregation in GRPO differs from logarithmic pooling.

02

The penalty function corresponds to reverse KL divergence.

03

For groups of size two, preferences relate to pairwise comparisons.

Abstract

In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification

MethodsSparse Evolutionary Training