Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents
Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

TL;DR
Dyna-Think is a framework that combines planning, reasoning, and world model simulation to improve AI agent performance, demonstrating enhanced capabilities and efficiency in complex tasks.
Contribution
It introduces Dyna-Think, a novel integration of planning, reasoning, and world modeling, with new training methods DIT and DDT to enhance AI agent effectiveness.
Findings
Dyna-Think improves in-domain and out-of-domain performance.
Critique generation enhances world model training effectiveness.
Better world modeling correlates with improved agent performance.
Abstract
Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance…
Peer Reviews
Decision·Submitted to ICLR 2026
- this work is generally well-written and presented cleanly - the idea of integrating world model simulation into the cot process is interesting, and various variants of such an integration have been explored and compared - the experiments are extensive, with useful ablations and analyses
- Why DDT > vanilla Dyna is under-explained. The authors show DDT (especially next-state and critique prediction) > vanilla Dyna under the same rollout budget (Table 1), but do not disentangle why world modelling helps in DDT and not in vanilla Dyna. Can the authors specify how the "separate $W(\mu)$" is trained/used in vanilla Dyna (e.g., model size, training signals, fidelity checks, rollout usage)? - The authors say added evaluation hints are removed during training and testing in Sec 5.2, so
1. The paper aims to tackle a timely topic as both language agents and world models have shown promise in their own respective domains, but they have not been super successfully combined so far. 2. I like that the paper measures both in-domain and out-of-domain performance. 3. The results tend to suggest some of the proposed modifications indeed help performance over the trained baselines. 4. Figure 4 seems like a nice result since it implies you can continue learning mostly (or purely?) through
1. Section 3.2: Based on Table A4 in the appendix, I’m not convinced that the benefits seen from doing imitation learning on R1 thinking reconstructions have anything to do with world modeling simulation. Specifically, did the authors try simpler prompts than the one listed in Table A4, maybe just keeping “remove unnecessary thinking parts without touching anything else”. In other words, how much of the benefit is just coming from generally letting GPT-4o clean up the thinking trace? 2. Figure 2
The work tackles the critical trade-off in AI agent design by proposing a framework to distill the capabilities of large expert models into smaller, task-optimised agents. This approach positions itself as a middle ground, mediating between the high computational cost of test-time search (e.g., MCTS) and the high token cost of the elaborate reasoning found in leading thinking models. The authors have employed a solid evaluation setup by using an held-out in-domain test sets, completely out-of-d
The claim of achieving performance similar to the R1 expert warrants a more nuanced analysis, as the aggregate metrics conceal significant weaknesses in generalization and reliability. For instance, the model's reliability drops precipitously in unfamiliar contexts: its Average Success Rate falls from 30.3% on in-domain tasks to just 17.6% out-of-domain. This suggests that the agent's learned 'world model' is highly specialized to the training applications and does not transfer effectively. This
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Reinforcement Learning in Robotics
MethodsFocus
