Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang; Xuecai Hu; Yong Wang; Feng Xiong; Man Zhang; Xiangxiang Chu

arXiv:2601.20354·cs.CV·January 30, 2026

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, Xiangxiang Chu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SpatialGenEval, a comprehensive benchmark for evaluating the spatial reasoning capabilities of text-to-image models, revealing current limitations and proposing a data-centric approach to improve spatial understanding.

Contribution

The paper presents a new benchmark with dense prompts and a related dataset to systematically assess and enhance the spatial reasoning abilities of T2I models.

Findings

01

Higher-order spatial reasoning is a major bottleneck.

02

Fine-tuning improves spatial relation accuracy by around 4-6%.

03

Information-dense prompts lead to more realistic spatial images.

Abstract

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The work introduces previously untested axes of T2I capability, and provides a large benchmark designed to test models along these new dimensions. The benchmark consists of LLM-generated prompts with automated model-based evaluation, where the generation prompts and evaluation QA pairs are refined manually by humans. While the integrity of the model-based evals could be contested, the diversity of prompts and structured scoring (evenly across the clearly defined 10 sub-domains) is clear. The be

Weaknesses

In the analysis of error rates and failure cases, the authors mention "This clear hierarchy demonstrates that models learn skills in a specific order". However, the hierarchy is not clear from Figure 5; it seems for example that motion, which appears after relational reasoning, is already easier. Some clarification would be helpful here. It is unclear how impactful the contribution of the training data is. While the results show that SFT on the new data improves performance on their benchmark,

Reviewer 02Rating 4Confidence 4

Strengths

- The paper introduces a new benchmark with complex and information-dense prompts, which significantly improve upon previous benchmarks that rely on shorter prompts and limited number of spatial relations. - The paper also introduces a new evaluation protocol that incorporates a vision-language model (VLM) to assess the completeness of generated images through visual question answering, specifically designed to evaluate spatial understanding. - The benchmark includes human experts in refining pr

Weaknesses

• Although the paper proposes a benchmark with complex spatial relations, the evaluation protocol raises some concerns. As demonstrated by several prior works VLMs still struggle with spatial understanding—particularly in 3D spatial relations[1, 2, 3]. Therefore, the interpretation of the results is limited, as it remains unclear whether the observed issues stem from the T2I model itself or from the evaluation protocol. • Although the authors demonstrate that fine-tuning improves performance, t

Reviewer 03Rating 2Confidence 5

Strengths

* Well-motivated problem statement * Diverse evaluation dataset * Well-guiding principles behind dataset construction * Diverse results from many different models

Weaknesses

* Unclarity in writing * Lack of QA in MLLM-synthesized prompts * Emphasis on using 77 tokens (L259) despite the fact that modern models go well beyond this context length

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques