3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

TL;DR
3DIS introduces a two-stage, depth-driven framework for text-to-image generation that improves instance layout accuracy and attribute rendering without additional training, enhancing multi-instance generation capabilities.
Contribution
It proposes a novel decoupled approach that separates scene layout and attribute rendering, enabling robust, training-free multi-instance image synthesis with compatibility across models.
Findings
Outperforms existing methods in layout precision and attribute rendering.
Demonstrates robustness and adaptability across diverse foundational models.
Achieves significant improvements on COCO benchmarks.
Abstract
The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Computer Graphics and Visualization Techniques · Image Processing and 3D Reconstruction
MethodsAdapter
