Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang

TL;DR
Jigsaw-Puzzles is a new benchmark with 1,100 complex images designed to evaluate and challenge the spatial reasoning and structural understanding capabilities of vision-language models, revealing significant performance gaps compared to humans.
Contribution
The paper introduces Jigsaw-Puzzles, a novel benchmark dataset and evaluation framework specifically targeting spatial reasoning in vision-language models, highlighting current limitations.
Findings
State-of-the-art VLMs achieve only 77.14% accuracy on the benchmark.
Models perform poorly on the Order Generation task, with only 30% accuracy.
There is a significant gap between model performance and human performance, emphasizing the need for further research.
Abstract
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
