Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

Zesen Lyu; Dandan Zhang; Wei Ye; Fangdi Li; Zhihang Jiang; Yao Yang

arXiv:2505.20728·cs.AI·August 27, 2025

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang

PDF

Open Access 1 Datasets 1 Video

TL;DR

Jigsaw-Puzzles is a new benchmark with 1,100 complex images designed to evaluate and challenge the spatial reasoning and structural understanding capabilities of vision-language models, revealing significant performance gaps compared to humans.

Contribution

The paper introduces Jigsaw-Puzzles, a novel benchmark dataset and evaluation framework specifically targeting spatial reasoning in vision-language models, highlighting current limitations.

Findings

01

State-of-the-art VLMs achieve only 77.14% accuracy on the benchmark.

02

Models perform poorly on the Order Generation task, with only 30% accuracy.

03

There is a significant gap between model performance and human performance, emphasizing the need for further research.

Abstract

Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zesen01/Jigsaw-Puzzles
dataset· 298 dl
298 dl

Videos

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization