MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng

TL;DR
MIDI introduces a multi-instance diffusion approach for generating detailed 3D scenes from a single image, capturing multiple objects and their spatial relationships with high accuracy and generalization.
Contribution
It extends pre-trained image-to-3D models to multi-instance diffusion, incorporating a novel attention mechanism for simultaneous multi-object 3D scene generation.
Findings
Achieves state-of-the-art performance on synthetic and real-world data.
Effectively models inter-object interactions with limited scene-level supervision.
Maintains pre-trained generalization through combined training on scene and single-object data.
Abstract
This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · Music Technology and Sound Studies
MethodsSoftmax · Attention Is All You Need · Diffusion
