CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang,, Arash Vahdat

TL;DR
CamCo introduces a novel camera-controllable image-to-video generation method that enhances 3D consistency and user control over camera poses by integrating epipolar constraints and fine-tuning on real-world videos.
Contribution
We propose CamCo, a new framework that enables precise camera pose control and improved 3D consistency in image-to-video generation using epipolar attention and real-world fine-tuning.
Findings
Significantly improves 3D consistency in generated videos.
Enables fine-grained camera pose control.
Effectively synthesizes plausible object motion.
Abstract
Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our…
Peer Reviews
Decision·Submitted to ICLR 2025
**Good presentation** The paper is easy to read and have good figures to show the method. Overall the pipeline is reasonable and each module is clear introduced. **Good experimental results** The model shows superior results over existing controllable camera image-to-video generation results, with much lower COLMAP error and FVD. **Good generalization performance.** I appreciate the author conduct rich experiments on multiple source unseen data to show the generalization ability. The auth
**Very Limited Contribution** The 2 fundamental modules introduced in this paper: Plücker embedding and epipolar attention are all commonly used techniques. For Plücker embedding, previous work in CameraCtrl (He et al., 2024) also used the same technique. Even their used for text to video generation other than image-to-video, but the key techniques are the same, that is how to better incorporate the camera pose into the video generation model other than using R and t. Also works in 3D generati
1. The authors implement a data curation pipeline that annotates in-the-wild videos with estimated camera poses using structure-from-motion algorithms. This enhances the model's ability to generate plausible object motion in addition to camera movements, addressing the challenge of synthesizing dynamic scenes. 2. The paper provides thorough quantitative and qualitative evaluations, demonstrating that CamCo outperforms baseline methods in terms of visual quality, camera controllability, and geome
1. The biggest weakness of the paper is its technical contribution, its main designs are Plücker coordinates and epipolar attention, but none of these is exactly novel, even on the constrained domain of camera controllable video generation --- the former was used in CameraControl [1] and the later was used in Collaborative Video Diffusion [2]. The authors should discuss what is novel about their approach while using these techniques. 2. Without sufficient dynamic training data, the model tends t
The proposed method is explained clearly. The authors compare with a few baselines and show the best adherence to the input camera according to Table 1 (despite the most important one is missing). The curated dynamic video dataset can be a good contribution if released.
1. Despite the effort of annotating the dynamic dataset WebVid, from the results in the suppl. page, the foreground dynamic is still largely lost. Even the eagle example doesn't show prominent motion; in another example where a bird flying above a lake, the proposed method does produce more object translation than baselines, but the object size is fairly small. Arguably, the proposed method still suffers from the common problem shared with the state-of-the-art camera-conditioned methods, i.e., l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Medical Image Segmentation Techniques · Image Processing Techniques and Applications
MethodsDiffusion
