Taming Mode Collapse in Score Distillation for Text-to-3D Generation
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest, Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra

TL;DR
This paper identifies mode collapse as the cause of view inconsistency in text-to-3D generation and proposes Entropic Score Distillation (ESD) to enhance diversity and reduce artifacts by reintroducing an entropy term.
Contribution
It introduces ESD, a novel method that re-establishes the entropy term in score distillation, effectively mitigating Janus artifacts in text-to-3D generation.
Findings
ESD effectively reduces Janus artifacts in experiments.
Maximizing entropy encourages view diversity.
The method is simple to implement using classifier-free guidance.
Abstract
Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing the entropy term in the corresponding variational objective, which is applied to the distribution of rendered images.…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- The manuscript introduces an innovative approach that offers a fresh perspective on mitigating the Janus artifact in text-to-3D generation, setting it apart from prior works. - The manuscript includes detailed mathematical derivations and demonstrations supporting the proposed method.
In the experimental section, the authors demonstrate the model's effectiveness by presenting both qualitative and quantitative outcomes of 3D generation across various prompts. Nevertheless, these findings may not comprehensively represent the overall scenario, particularly in cases where the model's performance exhibits instability. It would enhance the study if additional analyses or discussions were included. These analyses could focus on identifying the specific target distribution (whether
The strength of this paper is listed as follows: - The paper includes a detailed algorithm in implementing proposed method Entropic Score Distillation (ESD), which suggests a clear path for replication and verification. Also, I believe there is no other computational burden in implementing ESD. - The Janus problem is considered as a main problem of text-to-3D generation with 2D diffusion prior, in which the author discusses how to overcome this problem.
Despite the method is easy-to-follow, I believe there are many room for improvement in improving the proposed method: - Motivation: It is unclear how the mode collapse is related to the Janus problem. In specific, what is the precise definition of mode collapse in text-to-3D generation? In the general context, I believe the term mode collapse is used for the inability of the generative model to generate diverse output in distribution-level. However, it seems like the paper considers a problem in
1. This paper introduces classifier-free guidance for the camera condition in VSD's gradient update rule, providing a valuable tool for text-to-3D field by demonstrating that this can be interpreted as a term considering entropy for the q distribution. 2. The motivation is intuitive, and the method is easy to implement.
The major concern is the lack of experiments and analysis. This paper only shows results for six prompts, including all results in the main paper, appendix, and supplementary video. I think it could be insufficient to convince that the proposed method adequately addresses the Janus problem. Although additional quantitative results are provided, they aren't metrics about the Janus problem or mode collapse, which this paper mainly addresses. I understand there's currently no proper metric that ful
1. The idea is simple and intuitive. 2. The math part seems not wrong to me. 3. Experiments show the ESD performs better than baselines.
1. The main concern is about the evaluation. Since the results can be cherry-picked, could you report the success rate of generation, i.e., how many generated results does not have Janus problem? 2. Missing baselines: Several methods on eliminating the Janus problems should be compared: [1] Hong, Susung, Donghoon Ahn, and Seungryong Kim. "Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation." arXiv preprint arXiv:2303.15413 (2023). [2] Armandpour, Mohammadreza, et al. "
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis
