Geometric-Mean Policy Optimization

Yuzhong Zhao; Yue Liu; Junpeng Liu; Jingye Chen; Xun Wu; Yaru Hao; Tengchao Lv; Shaohan Huang; Lei Cui; Qixiang Ye; Fang Wan; Furu Wei

arXiv:2507.20673·cs.CL·October 21, 2025

Geometric-Mean Policy Optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

PDF

3 Reviews

TL;DR

This paper introduces Geometric-Mean Policy Optimization (GMPO), a method that enhances the stability and performance of large language models by replacing the arithmetic mean with the geometric mean of token rewards, reducing outlier sensitivity.

Contribution

GMPO is a simple yet effective modification of GRPO that improves stability and performance by using the geometric mean of token rewards, supported by theoretical analysis and empirical results.

Findings

01

GMPO improves Pass@1 by up to 4.1% over GRPO.

02

GMPO outperforms several state-of-the-art methods on reasoning benchmarks.

03

GMPO demonstrates more stable policy updates due to reduced outlier influence.

Abstract

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The problem is clearly defined and motivated. The authors narrow their focus to a specific, plausible mechanism: the sensitivity of the arithmetic mean aggregator in the GRPO objective to these outlier IS ratios. - The method is practical and easy to implement. The loss is a small rewrite of GRPO. The authors enhance this practicality by providing clear pseudo-code in Algorithm 1, which details the implementation in log-space for numerical stability.1 This transparency is crucial for reprodu

Weaknesses

- The core novelty of the paper lies in the application of the geometric mean to the GRPO objective. While this application is new in this specific context, the underlying idea of using a robust statistical estimator to handle outliers is a foundational concept in statistics and data analysis. This weakness is compounded by a failure to justify the choice of the geometric mean over other standard robust estimators that designed to be less sensitive to outliers and could plausibly offer similar o

Reviewer 02Rating 6Confidence 4

Strengths

The method is easy to follow, and the paper writing is clear.

Weaknesses

* Judging from Table 4, it seems that the proposed method's gain is quite marginal. The token-wise clip also seems not so effective. * The major technical change from GRPO seems to be token-level clipping + geometric mean vs group arithmetic mean. The technical novelty is low. * Although discussion is present, comparisons is missing against the method that controls training stability by carefully setting clipping ratio (e.g. DAPO)

Reviewer 03Rating 8Confidence 5

Strengths

1. The core idea of replacing the arithmetic mean with the geometric mean is simple, elegant, and well-motivated. It directly targets a plausible source of instability in GRPO (sensitivity to outlier rewards) with a classic statistical tool known for its robustness. The "plug-and-play" nature of the modification makes it highly practical. 2. The empirical evaluation is comprehensive and convincing. The authors demonstrate consistent improvements over GRPO across multiple model sizes (1.5B, 7B,

Weaknesses

1. The manuscript has numerous small but distracting typographical and formatting errors. * The text is missing a character at the beginning of lines 190 and 202. * The formulas in the derivation on lines 182-189 lack eq number. * Several plots in Figure 4 are missing x-axis labels (e.g., plots c, e, g), which should be labeled "Training steps" for clarity. 2. The related work section, while comprehensive, reads like a long list of recent methods without sufficient structure

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.