Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation
Yu Lei, Bingde Liu, Qingsong Xie, Haonan Lu, Zhijie Deng

TL;DR
This paper introduces $L^2$-VSD, a linearized lookahead variational score distillation method that improves text-to-3D generation quality by addressing convergence issues in existing score distillation techniques.
Contribution
It proposes a linearized lookahead approach to enhance variational score distillation, leading to more stable training and better 3D generation results.
Findings
$L^2$-VSD outperforms prior methods in quality.
The method is efficient with existing autodiff tools.
It can be integrated into various VSD-based frameworks.
Abstract
Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state…
Peer Reviews
Decision·Submitted to ICLR 2025
+ The paper is well-structured. The writing is clear. The methodology section presents a detailed comparison between VSD and L-VSD, followed by a clear derivation of $L^2$-VSD. + The proposed $L^2$-VSD method addresses the mismatching problem in VSD by adjusting the optimization order and using a linearized variant for score distillation.
- The proposed $L^2$-VSD method seems to be highly dependent on the specific settings and assumptions of the VSD framework. It is not clear how well it would generalize to other text-to-3D generation methods like Gaussian Dreamer or LucidDreamer, which are recent SOTA, or different types of 3D representations such as Gaussian Splatting. As the original VSD takes 8 hours to generate one 3D model, while Gaussian Dreamer takes only 15 mins. The reviewer is afraid the proposed method does not give s
1. Variational Score Distillation (VSD) is a representative score distillation method for diffusion-guided 3D generation. The defect analysis and improvement of VSD may provide inspiration for subsequent research. To the best of my knowledge, the analysis of LoRA training for VSD is original and somewhat interesting. 2. The paper is well-written and clearly structured.
1. This work only explores the potential issues of LoRA training for VSD, which limits the scope of this work. In fact, VSD is just one of the score distillation methods, and some newer SDS techniques such as ISM do not require a LoRA model. Even for VSD, the introduction of LoRA is already a compromise in implementation, and further analysis of its theory-implementation gaps seems trivial. I'm not opposed to this kind of exploration, but I expect it to bring more significant results than the ma
1. L2-VSD provides more stable convergence by utilizing a linearized lookahead correction. 2. L2-VSD can be incorporated into other VSD-based frameworks, such as HiFA.
The visual results presented for L2-VSD do not clearly demonstrate an improvement over those generated by VSD. Specifically, Figure 7 does not convincingly address known issues with VSD, such as saturated colors and visual artifacts in 3D assets. Additionally, the quantitative results in Table 1 are somewhat misleading. In most contexts, a higher CLIP similarity score indicates a better match with the prompt, yet L2-VSD, which has a lower CLIP similarity than other methods, is highlighted in bol
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Human Motion and Animation · Video Analysis and Summarization
