Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier; Dat Nguyen; Michelangelo Naim; Moksh Jain; Yichao Liang; Karen Schroeder; Cambridge Yang; Joshua B. Tenenbaum; Sebastian Vollmer; Kevin Ellis; Zenna Tavares

arXiv:2510.19788·cs.AI·May 11, 2026

Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

PDF

TL;DR

This paper introduces WorldTest, a new evaluation protocol for assessing whether AI agents learn comprehensive world models capable of answering diverse environment-level questions, highlighting gaps between human and machine understanding.

Contribution

The paper proposes WorldTest as a novel benchmarking framework and instantiates it as AutumnBench, enabling evaluation of world models on environment-level queries in grid-world environments.

Findings

01

Humans outperform frontier models on environment-level queries.

02

AutumnBench includes 43 environments and 129 tasks for diverse question types.

03

Models show limited generality compared to human understanding.

Abstract

World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build $general-purpose$ models that can answer many different questions about an environment $\unicode x 2014$ including questions that require understanding global structure and counterfactual consequences. We propose $WorldTest$ : a protocol for evaluating whether agents learn models that support multiple $environment-level queries \unicode x 2014$ questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.