ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts
Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua

TL;DR
ModeDreamer introduces an image-guided score distillation loss (ISD) for text-to-3D generation, effectively guiding models toward specific modes, reducing over-smoothing, and improving output quality and stability.
Contribution
The paper proposes ISD, a novel image prompt score distillation loss, and IP-Adapter, a lightweight module, to enhance mode control and stability in text-to-3D generation.
Findings
Achieves high-quality, coherent 3D outputs.
Improves optimization speed over prior methods.
Enhances stability and mode control in text-to-3D synthesis.
Abstract
Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Handwritten Text Recognition Techniques
MethodsAdapter · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
