Scaling Agent Learning via Experience Synthesis

Zhaorun Chen; Zhuokai Zhao; Kai Zhang; Bo Liu; Qi Qi; Yifan Wu; Tarun Kalluri; Sara Cao; Yuanhao Xiong; Haibo Tong; Huaxiu Yao; Hengduo Li; Jiacheng Zhu; Xian Li; Dawn Song; Bo Li; Jason Weston; Dat Huynh

arXiv:2511.03773·cs.AI·November 11, 2025

Scaling Agent Learning via Experience Synthesis

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

PDF

Open Access 3 Reviews

TL;DR

DreamGym is a unified framework that synthesizes diverse, scalable experiences through reasoning-based environment modeling, significantly improving reinforcement learning training efficiency and transferability in autonomous agents.

Contribution

The paper introduces DreamGym, a novel experience synthesis framework that enables scalable, effective online RL training by modeling environment dynamics and generating diverse tasks.

Findings

01

Outperforms baselines by over 30% on WebArena.

02

Matches state-of-the-art RL performance with synthetic data.

03

Enhances sim-to-real transfer with fewer real interactions.

Abstract

While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper targets an important bottleneck in the field of autonomous agents. The prohibitive cost and low sample efficiency of online RL in complex, real-world environments (like web browsers) is a major barrier to progress. The paper's goal of creating a scalable, synthetic training environment is highly relevant and impactful. The framework's ability to enable effective RL on WebArena, an environment considered "not RL-ready", is the most compelling result. Achieving success rates over 30-45

Weaknesses

The central idea is to use an LLM as a learned world model for model-based RL. The paper cites related work like Dreamer and other LLM-based environment models but claims to be different by focusing on "policy improvement" rather than "fidelity-first". This is a weak distinction, as policy improvement is the goal of all MBRL systems, including Dreamer. The main novelties are the (effective) use of CoT reasoning in the model's SFT training and the entropy-based curriculum generator. These are goo

Reviewer 02Rating 2Confidence 3

Strengths

1. By training in synthetic environments and combining a model capable of distilling experience with experience replay and an automatic problem-generation (curriculum) mechanism, the authors reduce dependence on the real system and improve training stability and efficiency. On benchmarks such as WebArena, WebShop, and ALFWorld, the method achieves strong results, which further improve after incorporating a small amount of real data. 2. The exposition is clear: the paper’s narrative is well stru

Weaknesses

1. The work lacks experimental comparisons with representative methods in the “LLM Agents Reinforcement Learning” line of research; existing comparisons focus mainly on SFT/DPO and PPO/GRPO, making it difficult to accurately define this method’s relative advantages and applicability boundaries among peer approaches. Meanwhile, the related-work survey is neither systematic nor sufficiently comprehensive, with inadequate coverage of recent developments. 2. Although the paper claims reduced real-w

Reviewer 03Rating 6Confidence 3

Strengths

- Targets a practical yet challenging problem in LLM agent RL — reducing training cost, wall-clock time, and instability arising from expensive environment interactions. - The paper is well-structured and clearly written: the problem is precisely defined, and each subproblem is addressed through three clearly delineated components — experience model inference, experience model training, and curriculum-based task generation. It makes the overall paper easy to follow, even for non-expert readers.

Weaknesses

- Potential error accumulation in self-training loop: Since the experience model continuously generates synthetic rollouts and refines itself using those same transitions, any early bias or inaccuracy in its learned dynamics may compound over iterations. While this may not be severe in relatively simple domains (as Figure 4 suggests small gaps between real and synthetic transitions), it could become problematic in more complex or long-horizon environments where early modeling errors propagate an

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Games