Vision-language models lag human performance on physical dynamics and intent reasoning
Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan, Athanasios V

TL;DR
This paper introduces Teleo-Spatial Intelligence (TSI) and EscherVerse, a large-scale dataset, to evaluate and improve vision-language models' ability to reason about physical dynamics and human intent, revealing a significant gap compared to human performance.
Contribution
The paper presents TSI as a novel reasoning framework and EscherVerse as a new dataset for evaluating spatial reasoning, highlighting limitations of current models in understanding physical interactions and intent.
Findings
State-of-the-art models achieve around 57% accuracy, below human performance of over 90%.
Fine-tuning on real-world data reduces but does not eliminate the reasoning gap.
EscherVerse serves as a diagnostic tool for spatial reasoning in embodied AI.
Abstract
Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We introduce Teleo-Spatial Intelligence (TSI), a reasoning capability that links spatiotemporal change to goal-directed structure. To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set. Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
