Scaling Agent Learning via Experience Synthesis
Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

TL;DR
DreamGym is a unified framework that synthesizes diverse, scalable experiences through reasoning-based environment modeling, significantly improving reinforcement learning training efficiency and transferability in autonomous agents.
Contribution
The paper introduces DreamGym, a novel experience synthesis framework that enables scalable, effective online RL training by modeling environment dynamics and generating diverse tasks.
Findings
Outperforms baselines by over 30% on WebArena.
Matches state-of-the-art RL performance with synthetic data.
Enhances sim-to-real transfer with fewer real interactions.
Abstract
While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an…
Peer Reviews
Decision·ICLR 2026 Poster
The paper targets an important bottleneck in the field of autonomous agents. The prohibitive cost and low sample efficiency of online RL in complex, real-world environments (like web browsers) is a major barrier to progress. The paper's goal of creating a scalable, synthetic training environment is highly relevant and impactful. The framework's ability to enable effective RL on WebArena, an environment considered "not RL-ready", is the most compelling result. Achieving success rates over 30-45
The central idea is to use an LLM as a learned world model for model-based RL. The paper cites related work like Dreamer and other LLM-based environment models but claims to be different by focusing on "policy improvement" rather than "fidelity-first". This is a weak distinction, as policy improvement is the goal of all MBRL systems, including Dreamer. The main novelties are the (effective) use of CoT reasoning in the model's SFT training and the entropy-based curriculum generator. These are goo
1. By training in synthetic environments and combining a model capable of distilling experience with experience replay and an automatic problem-generation (curriculum) mechanism, the authors reduce dependence on the real system and improve training stability and efficiency. On benchmarks such as WebArena, WebShop, and ALFWorld, the method achieves strong results, which further improve after incorporating a small amount of real data. 2. The exposition is clear: the paper’s narrative is well stru
1. The work lacks experimental comparisons with representative methods in the “LLM Agents Reinforcement Learning” line of research; existing comparisons focus mainly on SFT/DPO and PPO/GRPO, making it difficult to accurately define this method’s relative advantages and applicability boundaries among peer approaches. Meanwhile, the related-work survey is neither systematic nor sufficiently comprehensive, with inadequate coverage of recent developments. 2. Although the paper claims reduced real-w
- Targets a practical yet challenging problem in LLM agent RL — reducing training cost, wall-clock time, and instability arising from expensive environment interactions. - The paper is well-structured and clearly written: the problem is precisely defined, and each subproblem is addressed through three clearly delineated components — experience model inference, experience model training, and curriculum-based task generation. It makes the overall paper easy to follow, even for non-expert readers.
- Potential error accumulation in self-training loop: Since the experience model continuously generates synthetic rollouts and refines itself using those same transitions, any early bias or inaccuracy in its learned dynamics may compound over iterations. While this may not be severe in relatively simple domains (as Figure 4 suggests small gaps between real and synthetic transitions), it could become problematic in more complex or long-horizon environments where early modeling errors propagate an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Artificial Intelligence in Games
