CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for   Zero-Shot Customized Video Diffusion Transformers

D. She; Mushui Liu; Jingxuan Pang; Jin Wang; Zhen Yang; Wanggui He,; Guanghao Zhang; Yi Wang; Qihan Huang; Haobin Tang; Yunlong Yu; Siming Fu

arXiv:2502.06527·cs.CV·February 21, 2025

CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He,, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu

PDF

Open Access

TL;DR

CustomVideoX is a novel framework for personalized video generation that uses 3D reference attention and dynamic bias strategies to improve consistency and quality, leveraging pre-trained video diffusion transformers.

Contribution

It introduces 3D Reference Attention, TAB, and ERAE modules for effective personalized video synthesis, along with a new benchmark for evaluation.

Findings

01

Outperforms existing methods in video consistency and quality

02

Efficiently utilizes pre-trained models by training only LoRA parameters

03

Establishes VideoBench, a new benchmark for personalized video generation

Abstract

Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion