Flow Score Distillation for Diverse Text-to-3D Generation
Runjie Yan, Kailu Wu, Kaisheng Ma

TL;DR
This paper introduces Flow Score Distillation (FSD), a novel method that improves diversity in text-to-3D generation by modifying noise sampling strategies, building on insights from Score Distillation Sampling and DDIM models.
Contribution
The paper reveals the connection between SDS and DDIM, and proposes a new noise sampling approach that significantly enhances diversity in text-to-3D generation.
Findings
FSD improves diversity without losing quality.
The noise sampling strategy is crucial for diversity.
FSD outperforms existing methods in experiments.
Abstract
Recent advancements in Text-to-3D generation have yielded remarkable progress, particularly through methods that rely on Score Distillation Sampling (SDS). While SDS exhibits the capability to create impressive 3D assets, it is hindered by its inherent maximum-likelihood-seeking essence, resulting in limited diversity in generation outcomes. In this paper, we discover that the Denoise Diffusion Implicit Models (DDIM) generation process (\ie PF-ODE) can be succinctly expressed using an analogue of SDS loss. One step further, one can see SDS as a generalized DDIM generation process. Following this insight, we show that the noise sampling strategy in the noise addition stage significantly restricts the diversity of generation results. To address this limitation, we present an innovative noise sampling approach and introduce a novel text-to-3D method called Flow Score Distillation (FSD).…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
S1. This paper aims to solve an important problem: the lack of diversity in generated 3D results by score distillation. S2. The proposed world-map noise function is interesting. S3. The results show varied 3D objects given the same prompt without performance degradation compared with SDS.
W1. Lack of novelty and originality. The existing papers have already discussed the connection between DDIM and SDS [NewRef-1, NewRef-2] and using a fixed noise for SDS [NewRef-2, NewRef-3]. In addition, the paper does not include enough rationales of how the proposed noise sampling can resolve the issues in using a fixed noise. W2. Insufficient experiments. The paper lacks in-depth analysis on the proposed noise sampling technique. In addition, some results of previous methods show much diffe
- This paper formulates SDS as a generalized DDIM (Denoise Diffusion Implicit Models) process and introduces a world-map noise function for 3D generation, the noise mechanism design is simple and seems effective. - I like the quality of its generated 3D assets, which are sharp and come with fine-grained details, the results quality is consistent across various examples from main paper and supp. - it also reveals the relationships between initial noise map to final 3D assets, which is insightful
- While I like the quality of plotted 3D assets examples, the biggest concern I think is on the contribution significance and novelty. All the strategies like leveraging multi-view diffusion models, scheduled noise level annealing, SDS formulation are already extensively explored in previous works starting from MV-Dream, ProlificDreamer, VSD. The innovation on noise map sampling is kind weak - I think the comparison to baselines are also not fair and informative enough, as the proposed method is
- The paper built a connection between PF-ODE and SDS gradient to show that the SDS gradient is equivalence to some term in diffusion. - The paper proposes to use a consistent noise (which is only possible with its FSD formulation) to encourage less multi-face problems. - The paper shows that the proposed method can improve consistency.
- Inadequate analysis support to the claim of quality and diversity improvement. The paper claims that the proposing method has better quality and diversity comparing with previous method. I appreciate the FID analysis experiment. However, it is not convincing. Firstly, the FID is computed between the rendered images and generated images, where the rendered images are from 16 (prompt) x 4 (seed) = 64 3D objects (from my understanding). The amount of evaluated 3D objects is relatively lower. Seco
1. The paper is well-written and easy to understand. 2. The paper build a connection between SDS loss and the DDIM generation process, which is valuable and could inspire further research. 3. The proposed coarse-to-fine pipeline, though primarily an engineering solution, effectively contributes to generating high-quality 3D objects.
1. The concept of applying deterministic (fixed) noise to the SDS loss was introduced in previous work [1], but this prior work is neither cited nor discussed in the paper. My understanding is that the single-noise training in [1] is equivalent to the vanilla design described in Sec 4.1.1. 2. The proposed world-map noise function does not account for depth information -- noise is sampled without considering the shape of the generated object. Moreover, in the context of latent diffusion models,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Computer Graphics and Visualization Techniques · Video Analysis and Summarization
MethodsDiffusion
