ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation   using Reference Image Prompts

Uy Dieu Tran; Minh Luu; Phong Ha Nguyen; Khoi Nguyen; Binh-Son Hua

arXiv:2411.18135·cs.CV·March 4, 2025

ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua

PDF

Open Access

TL;DR

ModeDreamer introduces an image-guided score distillation loss (ISD) for text-to-3D generation, effectively guiding models toward specific modes, reducing over-smoothing, and improving output quality and stability.

Contribution

The paper proposes ISD, a novel image prompt score distillation loss, and IP-Adapter, a lightweight module, to enhance mode control and stability in text-to-3D generation.

Findings

01

Achieves high-quality, coherent 3D outputs.

02

Improves optimization speed over prior methods.

03

Enhances stability and mode control in text-to-3D synthesis.

Abstract

Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization · Handwritten Text Recognition Techniques

MethodsAdapter · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion