RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang; Kangrui Wang; Qineng Wang; Pingyue Zhang; Linjie Li; Zhengyuan Yang; Xing Jin; Kefan Yu; Minh Nhat Nguyen; Licheng Liu; Eli Gottlieb; Yiping Lu; Kyunghyun Cho; Jiajun Wu; Li Fei-Fei; Lijuan Wang; Yejin Choi; Manling Li

arXiv:2504.20073·cs.LG·May 27, 2025·2 cites

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

PDF

Open Access 2 Repos

TL;DR

This paper introduces RAGEN, a modular framework for training large language model agents with multi-turn reinforcement learning, addressing challenges like reward variance and reasoning emergence, and providing insights into effective training strategies.

Contribution

The paper proposes StarPO, a novel trajectory-level RL framework, and RAGEN, a system for training and evaluating LLM agents, with new techniques for stabilizing training and enhancing reasoning.

Findings

01

Identification of the Echo Trap phenomenon in agent RL training.

02

Stabilization techniques like trajectory filtering and critic use improve training.

03

Diverse initial states and frequent sampling enhance RL rollout shaping.

Abstract

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning