KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos
Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely

TL;DR
This paper introduces KFC-W, a self-supervised model that generates 3D-consistent videos from unposed internet photos, capturing scene geometry without needing 3D annotations.
Contribution
We propose a scalable, 3D-aware video generation method trained solely on 2D internet photos and videos, outperforming existing models in geometric and appearance consistency.
Findings
Our model outperforms baselines in geometric and appearance consistency.
It enables camera control applications like 3D Gaussian Splatting.
Shows potential for scaling 3D scene learning using only 2D data.
Abstract
We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
