CamI2V: Camera-Controlled Image-to-Video Diffusion Model

Guangcong Zheng; Teng Li; Rui Jiang; Yehao Lu; Tao Wu; Xi Li

arXiv:2410.15957·cs.CV·December 5, 2024·2 cites

CamI2V: Camera-Controlled Image-to-Video Diffusion Model

Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li

PDF

Open Access 1 Repo 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces CamI2V, a novel camera-controlled image-to-video diffusion model that improves geometry consistency and controllability by using epipolar attention and robust evaluation, achieving significant performance gains.

Contribution

It proposes a new method for modeling noisy cross-frame interactions with epipolar attention and robust evaluation, enhancing camera controllability in video diffusion models.

Findings

01

25.64% improvement in camera controllability on RealEstate10K

02

Robust performance in dynamic and occluded scenarios

03

Efficient training and inference with limited memory

Abstract

Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

The paper is well written with a few nicely created figures, e.g., Fig. 1 and Fig. 2. The ideas of (1) clean vs. noisy condition & (2) register tokens are neat (but they are also related to the weakness and questions below). Despite tuned with static/rigid scene dataset -- RealEstate10k, from the few generated videos in the supplementary, the motion dynamic of foreground is not lost too much. The discussion in L457-479 is good and clearly supports the design choices. For fair comparison, the aut

Weaknesses

* Contribution is clearly articulated in L.135-139 but not verified in the experiment: using Plücker embedding and/or epipolar attention has become a norm in the 3D generation literature, e.g. [1]. Applying one of them, if not both, in the video generation field has also been done, e.g., CameraCtrl, CamCo etc. Therefore, the biggest technical contribution I see in this work, to my knowledge, is applying the idea of register tokens to account for occlusion, zero epipolar scenarios, etc. Despite

Reviewer 02Rating 6Confidence 4

Strengths

+ The epi-polar mask attention layer proposed in the paper helps to enhance the camera control ability for video diffusion models. Also, it is plug-and-play – we don't need to retrain other modules of the original pretrained VDM. + It is an interesting idea to include the register token to handle cases where the epipolar constraint on correspondences fails, although more discussion and evaluation on it are needed (see the weakness section). + The paper addresses the inaccuracy in the SfM for r

Weaknesses

+ Discussion and experiment missing for a key statement in the paper: While the paper states (in L112) that register tokens are included to handle rapid camera movements, occlusions, and dynamic objects, this contribution (also the key difference from CamCo) is not discussed in more detail. For example, how does this additional token help to deal with the non-epipolar constrained correspondences? Can the image-level register token handle pixel-level dense correspondences across frames (like movi

Reviewer 03Rating 5Confidence 5

Strengths

The paper is easy to follow, figures are intuitive and look good. Experiments shows the model outperforms all available state-of-the-art works.

Weaknesses

Novelty of the paper: The contribution of this paper highly resembles CamCo, which is released in June (3.5 months prior to the submission deadline). Both methods use plucker embedding, epipolar lines, and are aimed for image-2-video model. The paper mentioned that CamCo does not supports video trajectories with non-overlapping frames, and the introduction of register token alleviate this problem. Yet this can be considered as a relatively minor improvement, and is not well supported by experim

Code & Models

Repositories

ZGCTroy/CamI2V
pytorchOfficial

Models

🤗
MuteApo/CamI2V
model· ♡ 1
♡ 1

Datasets

MuteApo/RealCam-Vid
dataset· 3.6k dl
3.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Infrared Target Detection Methodologies · Advanced Measurement and Detection Methods

MethodsSoftmax · Attention Is All You Need · Diffusion