Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

Xixi Wu; Qianguo Sun; Ruiyang Zhang; Chao Song; Junlong Wu; Yiyan Qi; Hong Cheng

arXiv:2603.21972·cs.LG·March 24, 2026

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper systematically studies reinforcement learning strategies for long-horizon, tool-using agents in complex environments, providing practical insights and a recipe that improves agent performance on a challenging testbed.

Contribution

It offers a comprehensive empirical analysis of RL design choices for long-horizon agents, revealing key scale-dependent effects and environmental stability importance.

Findings

01

Reward and algorithm choices depend on model scale.

02

Optimal training samples are around 1K with mixed difficulty.

03

Environmental stability prevents policy degradation.

Abstract

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

xxwu/Agent-STAR-TravelDataset
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling