EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

TL;DR
EPiC introduces an efficient framework for 3D camera control in video diffusion models by automatically generating high-quality anchor videos without extensive annotations, enhancing accuracy and resource efficiency.
Contribution
The paper proposes a novel method to automatically create precise anchor videos and a lightweight conditioning module, improving camera control learning without requiring camera trajectory annotations.
Findings
Achieves state-of-the-art results on RealEstate10K and MiraData datasets.
Demonstrates robust zero-shot generalization to video-to-video scenarios.
Reduces training parameters and steps significantly compared to prior methods.
Abstract
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The core strength is the visibility-masking method for creating aligned anchor videos. This elegantly bypasses the need for 3D reconstruction, enabling more efficient training in data and compute than prior work. 2. EPiC achieves SOTA results on standard I2V camera control benchmarks (RealEstate10K, MiraData), demonstrating that the efficiency gains do not compromise final quality. 3. The model shows excellent zero-shot generalization to V2V tasks and demonstrates strong capability in handlin
1. The framework operates on a relative scale determined by an external depth estimator at inference time. This may prevents users from specifying precise, real-world camera movements (e.g., "move 2 meters"). 2. While qualitative results are promising, the paper lacks quantitative validation on a benchmark with dynamic objects and ground-truth camera motion (e.g., RealCam-Vid). This would allow for a direct and fair comparison of dynamic handling capabilities with methods like RealCam-I2V.
1. The visibility-based masking for anchor video generation is conceptually simple yet effective. 2. The method achieves SOTA quality and camera control scores on both I2V and V2V test sets, and the ablation studies verify the effectiveness of the proposed methods.
1. The principal contribution lies in constructing more precisely aligned anchor videos to improve training efficiency. Is the proposed module plug-and-play, and could it likewise enhance other diffusion-based video models? 2. I recommend including a failure-case analysis to more thoroughly illustrate the method’s limitations. 3. During inference, how robust is EPiC to substantial errors in point-cloud-rendered anchors? Since it is trained with high-quality mask-based anchors, this discrepancy c
1. Clear and interpretable design. 2. The idea of converting the problem into anchor-video construction is very interesting.
1. Converting camera-guided video generation into the task of supplementing an anchor video is an interesting idea. However, when applied to I2V, although the authors preserve foreground dynamics by masking foreground regions during guidance, the background is effectively forced to remain static. This imposes a limitation when treating video generation as a world model. Moreover, the approach depends to some extent on the performance of foreground extraction models (e.g., *GroundingDINO* used in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Model Reduction and Neural Networks
