On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement
Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li

TL;DR
This paper identifies a core failure mode called Lazy Likelihood Displacement in Group Relative Policy Optimization for tool-integrated RL, and proposes a regularization method to stabilize training and improve performance.
Contribution
It uncovers Lazy Likelihood Displacement as a key cause of collapse in GRPO and introduces a likelihood-preserving regularization to mitigate this issue.
Findings
LLD causes early stagnation and collapse in training.
The proposed LLDS regularization stabilizes training and prevents gradient explosion.
Significant performance improvements on multiple benchmarks.
Abstract
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SEGAgentRL/LLDS-A-GRPO-Qwen2.5-7B-Insmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗SEGAgentRL/LLDS-A-GRPO-Qwen2.5-7B-Basemodel· 6 dl· ♡ 26 dl♡ 2
- 🤗SEGAgentRL/LLDS-A-GSPO-Qwen2.5-3B-Insmodel· 49 dl· ♡ 149 dl♡ 1
- 🤗SEGAgentRL/LLDS-R-GSPO-Qwen2.5-3B-Insmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SEGAgentRL/LLDS-R-GRPO-Qwen2.5-3B-Basemodel· 5 dl· ♡ 15 dl♡ 1
- 🤗SEGAgentRL/LLDS-A-GRPO-Qwen2.5-3B-Base-MAmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SEGAgentRL/LLDS-A-GRPO-Qwen2.5-3B-Basemodel· 4 dl4 dl
- 🤗SEGAgentRL/LLDS-R-GRPO-Qwen2.5-3B-Insmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗SEGAgentRL/LLDS-A-GRPO-Qwen2.5-3B-Insmodel· 4 dl4 dl
- 🤗SEGAgentRL/LLDS-A-GRPO-Llama3.2-3B-Base-MAmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
