CamI2V: Camera-Controlled Image-to-Video Diffusion Model
Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li

TL;DR
This paper introduces CamI2V, a novel camera-controlled image-to-video diffusion model that improves geometry consistency and controllability by using epipolar attention and robust evaluation, achieving significant performance gains.
Contribution
It proposes a new method for modeling noisy cross-frame interactions with epipolar attention and robust evaluation, enhancing camera controllability in video diffusion models.
Findings
25.64% improvement in camera controllability on RealEstate10K
Robust performance in dynamic and occluded scenarios
Efficient training and inference with limited memory
Abstract
Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is well written with a few nicely created figures, e.g., Fig. 1 and Fig. 2. The ideas of (1) clean vs. noisy condition & (2) register tokens are neat (but they are also related to the weakness and questions below). Despite tuned with static/rigid scene dataset -- RealEstate10k, from the few generated videos in the supplementary, the motion dynamic of foreground is not lost too much. The discussion in L457-479 is good and clearly supports the design choices. For fair comparison, the aut
* Contribution is clearly articulated in L.135-139 but not verified in the experiment: using Plücker embedding and/or epipolar attention has become a norm in the 3D generation literature, e.g. [1]. Applying one of them, if not both, in the video generation field has also been done, e.g., CameraCtrl, CamCo etc. Therefore, the biggest technical contribution I see in this work, to my knowledge, is applying the idea of register tokens to account for occlusion, zero epipolar scenarios, etc. Despite
+ The epi-polar mask attention layer proposed in the paper helps to enhance the camera control ability for video diffusion models. Also, it is plug-and-play – we don't need to retrain other modules of the original pretrained VDM. + It is an interesting idea to include the register token to handle cases where the epipolar constraint on correspondences fails, although more discussion and evaluation on it are needed (see the weakness section). + The paper addresses the inaccuracy in the SfM for r
+ Discussion and experiment missing for a key statement in the paper: While the paper states (in L112) that register tokens are included to handle rapid camera movements, occlusions, and dynamic objects, this contribution (also the key difference from CamCo) is not discussed in more detail. For example, how does this additional token help to deal with the non-epipolar constrained correspondences? Can the image-level register token handle pixel-level dense correspondences across frames (like movi
The paper is easy to follow, figures are intuitive and look good. Experiments shows the model outperforms all available state-of-the-art works.
Novelty of the paper: The contribution of this paper highly resembles CamCo, which is released in June (3.5 months prior to the submission deadline). Both methods use plucker embedding, epipolar lines, and are aimed for image-2-video model. The paper mentioned that CamCo does not supports video trajectories with non-overlapping frames, and the introduction of register token alleviate this problem. Yet this can be considered as a relatively minor improvement, and is not well supported by experim
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Infrared Target Detection Methodologies · Advanced Measurement and Detection Methods
MethodsSoftmax · Attention Is All You Need · Diffusion
