TL;DR
This paper introduces Spatial Causal Prediction (SCP), a new task and benchmark for evaluating models' ability to infer unseen spatial causal states in videos, highlighting current limitations and proposing strategies for improvement.
Contribution
The paper defines SCP as a novel task, creates SCP-Bench benchmark with 2,500 QA pairs, and evaluates 23 models to identify performance gaps and guide future research.
Findings
Models show significant performance gaps compared to humans.
Limited ability of models to extrapolate temporally and infer causality.
Proposed perception and reasoning strategies improve spatial causal understanding.
Abstract
Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
