TL;DR
This paper introduces CalibAdv, a method to calibrate advantage signals in deep search agents, improving training stability and performance by addressing advantage assignment issues in GRPO.
Contribution
CalibAdv provides a novel advantage calibration technique that enhances deep search training stability and effectiveness by fine-grained advantage adjustment.
Findings
CalibAdv improves model performance across multiple benchmarks.
CalibAdv stabilizes training, reducing collapse incidents.
CalibAdv effectively rebalances advantages for better intermediate step correctness.
Abstract
Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
