Vision-language models lag human performance on physical dynamics and intent reasoning

Tianjun Gu; Jingyu Gong; Zhizhong Zhang; Yuan Xie; Lizhuang Ma; Xin Tan; Athanasios V

arXiv:2601.01547·cs.CV·March 24, 2026

Vision-language models lag human performance on physical dynamics and intent reasoning

Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan, Athanasios V

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Teleo-Spatial Intelligence (TSI) and EscherVerse, a large-scale dataset, to evaluate and improve vision-language models' ability to reason about physical dynamics and human intent, revealing a significant gap compared to human performance.

Contribution

The paper presents TSI as a novel reasoning framework and EscherVerse as a new dataset for evaluating spatial reasoning, highlighting limitations of current models in understanding physical interactions and intent.

Findings

01

State-of-the-art models achieve around 57% accuracy, below human performance of over 90%.

02

Fine-tuning on real-world data reduces but does not eliminate the reasoning gap.

03

EscherVerse serves as a diagnostic tool for spatial reasoning in embodied AI.

Abstract

Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We introduce Teleo-Spatial Intelligence (TSI), a reasoning capability that links spatiotemporal change to goal-directed structure. To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set. Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Gradygu3u/EscherVerse-Data
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization