Benchmarking World-Model Learning with Environment-Level Queries
Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

TL;DR
This paper introduces WorldTest, a new evaluation protocol for assessing whether AI agents learn comprehensive world models capable of answering diverse environment-level questions, highlighting gaps between human and machine understanding.
Contribution
The paper proposes WorldTest as a novel benchmarking framework and instantiates it as AutumnBench, enabling evaluation of world models on environment-level queries in grid-world environments.
Findings
Humans outperform frontier models on environment-level queries.
AutumnBench includes 43 environments and 129 tasks for diverse question types.
Models show limited generality compared to human understanding.
Abstract
World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build models that can answer many different questions about an environmentincluding questions that require understanding global structure and counterfactual consequences. We propose : a protocol for evaluating whether agents learn models that support multiple questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
