Unified Camera Positional Encoding for Controlled Video Generation
Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai

TL;DR
This paper introduces UCPE, a unified camera encoding method that captures complete camera information, enabling advanced controllability in video generation with minimal additional training parameters.
Contribution
We propose Relative Ray Encoding and Absolute Orientation Encoding to create UCPE, a comprehensive camera representation that improves controllability and generalization in transformer-based video generation.
Findings
Achieves state-of-the-art camera controllability and visual fidelity.
Supports diverse camera intrinsics and lens distortions.
Requires less than 1% additional trainable parameters.
Abstract
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Face recognition and analysis
