VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Sixiao Zheng; Zimian Peng; Yanpeng Zhou; Yi Zhu; Hang Xu; Xiangru Huang; Yanwei Fu

arXiv:2502.07531·cs.CV·September 29, 2025

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

PDF

Open Access

TL;DR

VidCRAFT3 introduces a unified framework for image-to-video generation that allows precise control over camera, object, and lighting, overcoming previous limitations of separate control signals and dataset scarcity.

Contribution

It presents a novel integrated approach with three core components and a new dataset, enabling joint control in image-to-video generation.

Findings

01

Outperforms existing methods in control accuracy

02

Achieves higher visual coherence in generated videos

03

Demonstrates robustness with limited joint annotations

Abstract

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, and lighting direction enhances both accuracy and flexibility. However, existing approaches typically treat these control signals separately, largely due to the scarcity of datasets with high-quality joint annotations and mismatched control spaces across modalities. We present VidCRAFT3, a unified and flexible I2V framework that supports both independent and joint control over camera motion, object motion, and lighting direction by integrating three core components. Image2Cloud reconstructs a 3D point cloud from the reference image to enable precise camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale optical flow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · CCD and CMOS Imaging Sensors · Computer Graphics and Visualization Techniques

MethodsAttention Is All You Need · ADaptive gradient method with the OPTimal convergence rate · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing