SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
Yian Li, Yang Jiao, Bin Zhu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang

TL;DR
SpatialImaginer is a novel framework that enhances spatial reasoning in multimodal models by integrating textual planning with visual imagination, leading to improved robustness and state tracking in complex tasks.
Contribution
It introduces a unified multimodal generation approach combining text reasoning with visual imagination, supported by a difficulty-aware data engine for better spatial state preservation.
Findings
Achieves state-of-the-art performance on spatial reasoning benchmarks.
Substantially improves robustness in multi-step spatial tasks.
Effectively preserves geometric structures during reasoning.
Abstract
Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
