MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Zehuan Huang; Yuan-Chen Guo; Xingqiao An; Yunhan Yang; Yangguang Li; Zi-Xin Zou; Ding Liang; Xihui Liu; Yan-Pei Cao; Lu Sheng

arXiv:2412.03558·cs.CV·July 18, 2025

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng

PDF

Open Access 2 Models 2 Datasets

TL;DR

MIDI introduces a multi-instance diffusion approach for generating detailed 3D scenes from a single image, capturing multiple objects and their spatial relationships with high accuracy and generalization.

Contribution

It extends pre-trained image-to-3D models to multi-instance diffusion, incorporating a novel attention mechanism for simultaneous multi-object 3D scene generation.

Findings

01

Achieves state-of-the-art performance on synthetic and real-world data.

02

Effectively models inter-object interactions with limited scene-level supervision.

03

Maintains pre-trained generalization through combined training on scene and single-object data.

Abstract

This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · Music Technology and Sound Studies

MethodsSoftmax · Attention Is All You Need · Diffusion