CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He,, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu

TL;DR
CustomVideoX is a novel framework for personalized video generation that uses 3D reference attention and dynamic bias strategies to improve consistency and quality, leveraging pre-trained video diffusion transformers.
Contribution
It introduces 3D Reference Attention, TAB, and ERAE modules for effective personalized video synthesis, along with a new benchmark for evaluation.
Findings
Outperforms existing methods in video consistency and quality
Efficiently utilizes pre-trained models by training only LoRA parameters
Establishes VideoBench, a new benchmark for personalized video generation
Abstract
Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Advanced Vision and Imaging · Advanced Image Processing Techniques
MethodsSoftmax · Attention Is All You Need · Diffusion
