CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li

TL;DR
CPA introduces a unified diffusion transformer framework that incorporates camera pose awareness, enabling more controllable and realistic long video generation with improved trajectory and object consistency.
Contribution
It proposes a novel plug-in architecture with SME and TAI modules to integrate camera pose and motion control into diffusion-based video generation.
Findings
Outperforms LDM-based methods in long video generation
Achieves superior trajectory and object consistency
Demonstrates flexible camera pose and object movement handling
Abstract
Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Adam · Softmax · Label Smoothing · Dropout · Dense Connections · Layer Normalization · Diffusion
