CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Yuelei Wang; Jian Zhang; Pengtao Jiang; Hao Zhang; Jinwei Chen; Bo Li

arXiv:2412.01429·cs.CV·December 3, 2024

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li

PDF

Open Access

TL;DR

CPA introduces a unified diffusion transformer framework that incorporates camera pose awareness, enabling more controllable and realistic long video generation with improved trajectory and object consistency.

Contribution

It proposes a novel plug-in architecture with SME and TAI modules to integrate camera pose and motion control into diffusion-based video generation.

Findings

01

Outperforms LDM-based methods in long video generation

02

Achieves superior trajectory and object consistency

03

Demonstrates flexible camera pose and object movement handling

Abstract

Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Adam · Softmax · Label Smoothing · Dropout · Dense Connections · Layer Normalization · Diffusion