Improving Search Agent with One Line of Code

Jian Li; Dongsheng Chen; Zhenhua Xu; Yizhang Jin; Jiafu Wu; Chengjie Wang; Xiaotong Yuan; Yabiao Wang

arXiv:2603.10069·cs.LG·March 12, 2026

Improving Search Agent with One Line of Code

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

PDF

Open Access 7 Models

TL;DR

This paper introduces SAPO, a simple one-line modification to improve the stability and performance of search agents trained with TARL by preventing training collapse caused by importance sampling distribution drift.

Contribution

SAPO is a novel, easy-to-implement method that stabilizes TARL training with a conditional KL penalty, significantly enhancing search agent performance.

Findings

01

SAPO achieves +10.6% absolute improvement on QA benchmarks.

02

SAPO maintains training stability across different model sizes and families.

03

SAPO is deployable with only a one-line code change.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications