Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos
Kaihua Chen, Tarasha Khurana, Deva Ramanan

TL;DR
This paper introduces a novel approach for dynamic scene view synthesis from monocular videos, combining 3D reconstruction, inpainting with diffusion models, and test-time finetuning to outperform prior methods.
Contribution
It proposes a new method that integrates 3D scene reconstruction, self-supervised video inpainting, and zero-shot test-time finetuning for dynamic view synthesis.
Findings
Outperforms prior methods in dynamic scene view synthesis
Uses self-supervised inpainting trained on in-the-wild videos
Enables zero-shot application via test-time finetuning
Abstract
We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies
