Reinforcement Learning for Long-Horizon Multi-Turn Search Agents
Vivek Kalyan, Martin Andrews

TL;DR
This paper demonstrates that reinforcement learning significantly enhances the performance of large language model agents in complex, multi-turn search tasks, especially over longer horizons, outperforming existing models on a legal document search benchmark.
Contribution
It introduces RL training for large language models in multi-turn search tasks and shows improved accuracy and capabilities over traditional prompt-based approaches.
Findings
RL-trained 14B model achieves 85% accuracy on benchmark
Longer multi-turn horizons improve agent performance
RL approach outperforms frontier class models in accuracy
Abstract
Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
