Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee; Suhyung Choi; Inwoo Hwang; Byoung-Tak Zhang

arXiv:2508.10382·cs.CV·November 27, 2025

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Inwoo Hwang, Byoung-Tak Zhang

PDF

1 Video

TL;DR

This paper introduces a diffusion model approach that co-generates images and intrinsic scene properties like depth and segmentation, leading to more spatially consistent and realistic images without additional scene information.

Contribution

The work presents a novel method that integrates intrinsic scene properties into diffusion models, improving spatial consistency in generated images by co-generating images and scene intrinsics.

Findings

01

Reduces spatial inconsistencies in generated images

02

Maintains high image fidelity and textual alignment

03

Enhances scene realism and natural layout

Abstract

Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models· underline