COPO: Consistency-Aware Policy Optimization

Jinghang Han; Jiawei Chen; Hang Shao; Hao Ma; Mingcheng Li; Xintian Shen; Lihao Zheng; Wei Chen; Tao Wei; Lihua Zhang

arXiv:2508.04138·cs.LG·August 7, 2025

COPO: Consistency-Aware Policy Optimization

Jinghang Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang

PDF

3 Reviews

TL;DR

This paper introduces COPO, a reinforcement learning framework that enhances large language models' reasoning by using consistency-aware rewards and adaptive optimization, leading to improved performance on reasoning benchmarks.

Contribution

The paper proposes a novel consistency-aware policy optimization method that improves training signals and balances exploration and convergence in LLM reasoning tasks.

Findings

01

Significant performance improvements on mathematical reasoning benchmarks.

02

Robustness demonstrated across multiple tasks.

03

Effective reward design and optimization strategies.

Abstract

Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

Motivation matches a real pain point. GRPO’s zero-variance groups are common and wasteful; adding a batch-normalized prompt-level signal is a focused fix. Drop-in practicality. The method is easy to implement atop GRPO: compute prompt returns, standardize across the mini-batch, and mix with a simple entropy gate. Some positive results. On a stronger backbone (7B), COPO shows consistent improvements on math reasoning benchmarks with reasonably stable training curves. Useful ablations. Global-o

Weaknesses

Incremental novelty. Batch-level baselines/standardization are classic variance-reduction ideas in policy gradients; here they’re applied at the prompt level and gated. Useful in practice, but not conceptually deep. Benchmark choice feels dated. MATH-500 and GSM8K are saturated in 2025. The paper needs stronger generalization tests (recent AIME splits, IMO-style sets, and especially Putnam-AXIOM functional variations) to make the case. Statistics are thin. Headline tables/curves lack confidenc

Reviewer 02Rating 2Confidence 4

Strengths

The paper identifies a GRPO failure mode. This happens when all *G* rollouts for a prompt agree. They are all-right or all-wrong. This agreement creates near-zero standardized advantages. Which wastes samples. Fig. 1 shows this happens often in the 3B setup.

Weaknesses

1. The global advantage are the prompt-level mean reward. It is standardized by the minibatch mean/std (Eq. 6). This makes a constant baseline for all trajectories of a prompt. This is a standard variance-reduction control-variate. Calling it a "global optimization mechanism" overstates its novelty. Eq. 7 applies this same advantage to all tokens and trajectories. This discards the intra-prompt credit assignment. GRPO was designed to keep that assignment. 2. The paper claims COPO fixes gradie

Reviewer 03Rating 4Confidence 4

Strengths

- The paper does a good job of identifying, explaining, and quantifying the "advantage degeneration" problem in GRPO, showing it affects ~60% of samples for the Qwen2.5-3B-instruct model. - The core idea of introducing a batch-level "global" advantage to provide a signal when the "local" advantage vanishes is an elegant and effective solution. It correctly identifies the wastefulness of simply discarding this data. - The use of consistency entropy to soft-blend the local and global losses is an

Weaknesses

- The primary weakness of this paper is its limited novelty. The proposed method is a highly incremental design on top of GRPO, essentially adding a "global" reward layer to the existing "local" one. This combination of high complexity for a small incremental improvement feels more like an engineering design than a fundamental contribution. - The method does not fundamentally solve the "zero advantage" problem. The paper claims to address "advantage degeneration", but it merely sidesteps the iss

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.