SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

Yian Li; Yang Jiao; Bin Zhu; Tianwen Qian; Shaoxiang Chen; Jingjing Chen; Yu-Gang Jiang

arXiv:2604.17385·cs.CV·April 21, 2026

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

Yian Li, Yang Jiao, Bin Zhu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang

PDF

TL;DR

SpatialImaginer is a novel framework that enhances spatial reasoning in multimodal models by integrating textual planning with visual imagination, leading to improved robustness and state tracking in complex tasks.

Contribution

It introduces a unified multimodal generation approach combining text reasoning with visual imagination, supported by a difficulty-aware data engine for better spatial state preservation.

Findings

01

Achieves state-of-the-art performance on spatial reasoning benchmarks.

02

Substantially improves robustness in multi-step spatial tasks.

03

Effectively preserves geometric structures during reasoning.

Abstract

Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.