Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun, Jeong

TL;DR
This paper introduces a diffusion-based 3D-aware image synthesis method that localizes objects at different depths and combines multiple global styles using depth disentanglement and soft guidance techniques.
Contribution
It presents a novel framework, Compose and Conquer (CnC), integrating depth disentanglement and soft guidance for 3D-aware, multi-condition localized image synthesis.
Findings
Enables accurate 3D object placement in generated images
Allows compositional control of global semantics and object depth
Demonstrates versatility in synthesizing complex scenes
Abstract
Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Human Motion and Animation
MethodsDiffusion
