Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

Wei Chen; Yancheng Long; Mingqiao Liu; Haojie Ding; Yankai Yang; Hongyang Wei; Yi-Fan Zhang; Bin Wen; Fan Yang; Tingting Gao; Han Li; Long Chen

arXiv:2602.11980·cs.CV·February 13, 2026

Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

Wei Chen, Yancheng Long, Mingqiao Liu, Haojie Ding, Yankai Yang, Hongyang Wei, Yi-Fan Zhang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Long Chen

PDF

Open Access

TL;DR

The paper introduces Spatial Chain-of-Thought (SCoT), a plug-and-play framework that enhances diffusion models with spatial reasoning by integrating MLLMs as planners, leading to improved image generation and editing capabilities.

Contribution

It proposes a novel SCoT framework that combines MLLMs and diffusion models for spatial reasoning without high computational costs or information loss.

Findings

01

Achieves state-of-the-art results on image generation benchmarks.

02

Outperforms baselines on complex spatial reasoning tasks.

03

Effective in image editing scenarios.

Abstract

While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis