Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

Jiayi Wu; Ruobing Xie; Zeqian Huang; Lei Jiang; Can Xu; Kangyang Luo; Ming Gao; Xiang Li

arXiv:2604.18235·cs.CL·April 21, 2026

Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Ming Gao, Xiang Li

PDF

1 Repo

TL;DR

This paper introduces CalibAdv, a method to calibrate advantage signals in deep search agents, improving training stability and performance by addressing advantage assignment issues in GRPO.

Contribution

CalibAdv provides a novel advantage calibration technique that enhances deep search training stability and effectiveness by fine-grained advantage adjustment.

Findings

01

CalibAdv improves model performance across multiple benchmarks.

02

CalibAdv stabilizes training, reducing collapse incidents.

03

CalibAdv effectively rebalances advantages for better intermediate step correctness.

Abstract

Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wujwyi/CalibAdv
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.