NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer
Meng You, Zhiyu Zhu, Hui Liu, Junhui Hou

TL;DR
NVS-Solver leverages pre-trained video diffusion models to perform zero-shot novel view synthesis from limited views without additional training, using adaptive modulation based on scene priors.
Contribution
It introduces a training-free, adaptive modulation approach for view synthesis using large video diffusion models, grounded in theoretical modeling of the diffusion process.
Findings
Outperforms state-of-the-art methods quantitatively and qualitatively
Effective on both static and dynamic scenes
Operates without training, solely relying on pre-trained models
Abstract
By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose NVS-Solver, a new novel view synthesis (NVS) paradigm that operates \textit{without} the need for training. NVS-Solver adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the…
Peer Reviews
Decision·ICLR 2025 Poster
* The idea of using depth-warped images as guidance for novel view synthesis is reasonable. * It is interesting to see that the temporal consistent video diffusion model can be effectively reformulated to achieve geometrical consistent NVS in a training-free manner. * Experiments on several challenging settings, including 360-degree NVS from a single view, verify the significance of the introduced method.
* Accessing the geometry accuracy. For the 360-degree case, e.g., the truck, it would be better to apply mesh reconstruction on the rendered views, similar to Fig. 5(b) in latentSplat [Wewer et al. ECCV 2024]. The reconstructed mesh will provide a clearer understanding of how well the rendered views maintain correct geometry. * Pixel-aligned metrics. For the NVS task, it would be better to report comparisons with state-of-the-part methods regarding pixel-aligned metrics, e.g., PSNR and SSIM.
1. The proposed approach is entirely training-free, meaning it directly leverages pre-trained large video diffusion models without requiring additional fine-tuning or retraining. This feature not only reduces computational demands but also makes it adaptable to a wide range of applications where time or resources for training may be limited. The flexibility of using pre-trained models enhances its practicality, allowing users to apply this method to various scenes and tasks with minimal setup.
1. The comparison between this method and NeRF-based methods is fundamentally imbalanced. NeRF techniques incorporate an underlying 3D structure, enabling them to render any view with predictable performance, as the 3D structure informs which views are feasible and which are not. In contrast, the proposed method lacks an explicit 3D representation, limiting its view synthesis capabilities to specific views with no guarantee of consistent performance. This distinction is significant, as NeRF's in
1. The proposed adaptive modulation of the score function in the diffusion process is novel. 2. The proposed method achieves better results in various scenarios compared to baselines. 3. The authors provide the code with an anonymous link, ensuring the applicability of the results.
My primary concerns are with the references and experimental details: 1. Some key references on diffusion-based NVS are missing [1,2,3,4,5,6,7]. Among these, [3] specifically focuses on scenes and has released its code. Is there a particular reason it was not included in the comparison? 2. How is the synthesized view pose calculated in this paper? In Line 364, it states that 'current depth estimation algorithms struggle to derive absolute depth from a single view or monocular video, resulting i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Computer Graphics and Visualization Techniques
MethodsDiffusion
