DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion   Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai; Xiaodong Cun; Xiaoyu Li; Wenze Liu; Zhaoyang Zhang; Yong; Zhang; Ying Shan; Xiangyu Yue

arXiv:2412.18597·cs.CV·March 27, 2025

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong, Zhang, Ying Shan, Xiangyu Yue

PDF

Open Access 1 Repo

TL;DR

DiTCtrl introduces a training-free multi-prompt video generation method using attention control in MM-DiT architectures, enabling coherent, smooth transitions across multiple prompts without additional training.

Contribution

The paper presents the first training-free multi-prompt video generation approach leveraging attention sharing in MM-DiT models, addressing previous challenges of data requirements and unnatural transitions.

Findings

01

Achieves smooth multi-prompt video transitions

02

Demonstrates state-of-the-art performance without extra training

03

Introduces MPVBench benchmark for evaluation

Abstract

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/ditctrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Neural Networks and Reservoir Computing · CCD and CMOS Imaging Sensors

MethodsLinear Layer · Dense Connections · Residual Connection · Adam · Diffusion · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Dropout