Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo

TL;DR
InfCam introduces a novel depth-free video generation framework that uses infinite homography warping and data augmentation to achieve high-fidelity, camera-controlled video synthesis with diverse trajectories, outperforming existing methods.
Contribution
The paper proposes InfCam, a depth-free, camera-controlled video generation method that encodes 3D rotations in 2D latent space and enhances data diversity, improving pose fidelity and visual quality.
Findings
Outperforms baseline methods in camera-pose accuracy
Generalizes well from synthetic to real-world data
Achieves high visual fidelity in generated videos
Abstract
Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
