Current Agents Fail to Leverage World Model as Tool for Foresight
Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-T\"ur, Gokhan Tur, Yunzhu Li, Heng Ji

TL;DR
Current vision-language agents underutilize generative world models for foresight, often misusing or ignoring simulation capabilities, highlighting the need for better strategies to leverage these models for improved anticipatory reasoning.
Contribution
This paper empirically evaluates how current agents leverage generative world models, revealing significant underuse and misuse, and identifies key bottlenecks in their strategic utilization.
Findings
Few agents invoke simulation (less than 1%)
Approximately 15% misuse predicted rollouts
Performance can degrade by up to 5% when simulation is used or enforced
Abstract
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Social Robot Interaction and HRI
