Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

Yu Lei; Bingde Liu; Qingsong Xie; Haonan Lu; Zhijie Deng

arXiv:2507.09748·cs.CV·July 15, 2025

Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

Yu Lei, Bingde Liu, Qingsong Xie, Haonan Lu, Zhijie Deng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces $L^2$-VSD, a linearized lookahead variational score distillation method that improves text-to-3D generation quality by addressing convergence issues in existing score distillation techniques.

Contribution

It proposes a linearized lookahead approach to enhance variational score distillation, leading to more stable training and better 3D generation results.

Findings

01

$L^2$-VSD outperforms prior methods in quality.

02

The method is efficient with existing autodiff tools.

03

It can be integrated into various VSD-based frameworks.

Abstract

Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

+ The paper is well-structured. The writing is clear. The methodology section presents a detailed comparison between VSD and L-VSD, followed by a clear derivation of $L^2$-VSD. + The proposed $L^2$-VSD method addresses the mismatching problem in VSD by adjusting the optimization order and using a linearized variant for score distillation.

Weaknesses

- The proposed $L^2$-VSD method seems to be highly dependent on the specific settings and assumptions of the VSD framework. It is not clear how well it would generalize to other text-to-3D generation methods like Gaussian Dreamer or LucidDreamer, which are recent SOTA, or different types of 3D representations such as Gaussian Splatting. As the original VSD takes 8 hours to generate one 3D model, while Gaussian Dreamer takes only 15 mins. The reviewer is afraid the proposed method does not give s

Reviewer 02Rating 3Confidence 4

Strengths

1. Variational Score Distillation (VSD) is a representative score distillation method for diffusion-guided 3D generation. The defect analysis and improvement of VSD may provide inspiration for subsequent research. To the best of my knowledge, the analysis of LoRA training for VSD is original and somewhat interesting. 2. The paper is well-written and clearly structured.

Weaknesses

1. This work only explores the potential issues of LoRA training for VSD, which limits the scope of this work. In fact, VSD is just one of the score distillation methods, and some newer SDS techniques such as ISM do not require a LoRA model. Even for VSD, the introduction of LoRA is already a compromise in implementation, and further analysis of its theory-implementation gaps seems trivial. I'm not opposed to this kind of exploration, but I expect it to bring more significant results than the ma

Reviewer 03Rating 3Confidence 4

Strengths

1. L2-VSD provides more stable convergence by utilizing a linearized lookahead correction. 2. L2-VSD can be incorporated into other VSD-based frameworks, such as HiFA.

Weaknesses

The visual results presented for L2-VSD do not clearly demonstrate an improvement over those generated by VSD. Specifically, Figure 7 does not convincingly address known issues with VSD, such as saturated colors and visual artifacts in 3D assets. Additionally, the quantitative results in Table 1 are somewhat misleading. In most contexts, a higher CLIP similarity score indicates a better match with the prompt, yet L2-VSD, which has a lower CLIP similarity than other methods, is highlighted in bol

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Human Motion and Animation · Video Analysis and Summarization