Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

TL;DR
This paper introduces BranPO, a contrastive, value-free training method for multi-turn search agents that improves long-horizon task performance by reducing credit ambiguity and enhancing training stability without extra training cost.
Contribution
The paper proposes BranPO, a novel contrastive learning approach with adaptive sampling and masking, specifically designed to improve long-horizon agent training efficiency and accuracy.
Findings
BranPO outperforms strong baselines on question answering benchmarks.
It achieves significant accuracy improvements on long-horizon tasks.
The method maintains training efficiency without increasing overall training budget.
Abstract
Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
