Improving Search Agent with One Line of Code
Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

TL;DR
This paper introduces SAPO, a simple one-line modification to improve the stability and performance of search agents trained with TARL by preventing training collapse caused by importance sampling distribution drift.
Contribution
SAPO is a novel, easy-to-implement method that stabilizes TARL training with a conditional KL penalty, significantly enhancing search agent performance.
Findings
SAPO achieves +10.6% absolute improvement on QA benchmarks.
SAPO maintains training stability across different model sizes and families.
SAPO is deployable with only a one-line code change.
Abstract
Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗swordli/Qwen2.5-3B-Base-SAPOmodel· 71 dl71 dl
- 🤗swordli/Qwen2.5-3B-Instruct-SAPOmodel· 33 dl33 dl
- 🤗swordli/Qwen2.5-1.5B-Instruct-SAPOmodel· 30 dl30 dl
- 🤗swordli/Qwen2.5-7B-Instruct-SAPOmodel· 26 dl26 dl
- 🤗swordli/Qwen2.5-14B-Instruct-SAPOmodel· 31 dl31 dl
- 🤗swordli/Llama-3.2-3B-Instruct-SAPOmodel· 33 dl33 dl
- 🤗swordli/Llama-3.2-3B-Base-SAPOmodel· 32 dl32 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications
