Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan; Martin Andrews

arXiv:2510.24126·cs.CL·October 29, 2025

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan, Martin Andrews

PDF

TL;DR

This paper demonstrates that reinforcement learning significantly enhances the performance of large language model agents in complex, multi-turn search tasks, especially over longer horizons, outperforming existing models on a legal document search benchmark.

Contribution

It introduces RL training for large language models in multi-turn search tasks and shows improved accuracy and capabilities over traditional prompt-based approaches.

Findings

01

RL-trained 14B model achieves 85% accuracy on benchmark

02

Longer multi-turn horizons improve agent performance

03

RL approach outperforms frontier class models in accuracy

Abstract

Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.