SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

Haiyi Qiu; Kaihang Pan; Jiacheng Li; Juncheng Li; Siliang Tang; Yueting Zhuang

arXiv:2604.26341·cs.CV·April 30, 2026

SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

Haiyi Qiu, Kaihang Pan, Jiacheng Li, Juncheng Li, Siliang Tang, Yueting Zhuang

PDF

TL;DR

SpatialFusion introduces a framework that integrates 3D geometric understanding into unified image generation models, enhancing spatial awareness and coherence in generated images.

Contribution

It employs a Mixture-of-Transformers architecture with a spatial transformer and depth adapter to incorporate explicit geometric guidance into diffusion-based image generation.

Findings

01

Outperforms leading models like GPT-4o on spatial benchmarks.

02

Enhances performance in text-to-image and image editing tasks.

03

Maintains negligible inference overhead.

Abstract

Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.