DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong, Zhang, Ying Shan, Xiangyu Yue

TL;DR
DiTCtrl introduces a training-free multi-prompt video generation method using attention control in MM-DiT architectures, enabling coherent, smooth transitions across multiple prompts without additional training.
Contribution
The paper presents the first training-free multi-prompt video generation approach leveraging attention sharing in MM-DiT models, addressing previous challenges of data requirements and unnatural transitions.
Findings
Achieves smooth multi-prompt video transitions
Demonstrates state-of-the-art performance without extra training
Introduces MPVBench benchmark for evaluation
Abstract
Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Neural Networks and Reservoir Computing · CCD and CMOS Imaging Sensors
MethodsLinear Layer · Dense Connections · Residual Connection · Adam · Diffusion · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Dropout
