Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao; Weiquan Huang; Sudong Wang; Ruochen Zhao; Chen Chen; Yao Shu; Chengwei Qin

arXiv:2602.03719·cs.CL·February 4, 2026

Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

PDF

Open Access

TL;DR

This paper introduces BranPO, a contrastive, value-free training method for multi-turn search agents that improves long-horizon task performance by reducing credit ambiguity and enhancing training stability without extra training cost.

Contribution

The paper proposes BranPO, a novel contrastive learning approach with adaptive sampling and masking, specifically designed to improve long-horizon agent training efficiency and accuracy.

Findings

01

BranPO outperforms strong baselines on question answering benchmarks.

02

It achieves significant accuracy improvements on long-horizon tasks.

03

The method maintains training efficiency without increasing overall training budget.

Abstract

Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics