CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
Wenxin Ma, Chenlong Wang, Ruisheng Yuan, Hao Chen, Nanru Dai, S. Kevin Zhou, Yijun Yang, Alan Yuille, Jieneng Chen

TL;DR
CausalSpatial introduces a benchmark to evaluate object-centric causal spatial reasoning in models, revealing significant gaps in current multimodal models' ability to predict consequences in 3D scenes, and proposes a new framework to improve this reasoning.
Contribution
The paper presents CausalSpatial, a new benchmark for causal spatial reasoning, and introduces the COW model that externalizes simulation to enhance model grounding in physical reality.
Findings
Humans score 84% on the benchmark, GPT-5 scores 54%.
Current models rely heavily on textual reasoning, leading to ungrounded hallucinations.
COW improves reasoning by generating videos of hypothetical dynamics.
Abstract
Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Autonomous Vehicle Technology and Safety
