TL;DR
This paper introduces FrameCrafter, a method that adapts video diffusion models for sparse novel view synthesis by treating it as a permutation-invariant video completion task, achieving competitive results.
Contribution
The paper proposes architectural modifications to adapt video models for permutation-invariant sparse NVS, enabling effective view synthesis from few unordered images.
Findings
Video models contain implicit multi-view knowledge.
FrameCrafter achieves competitive performance on sparse-view NVS benchmarks.
Architectural changes enable models to forget temporal order with minimal supervision.
Abstract
We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given () multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
