ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Ozgur Kara; Krishna Kumar Singh; Feng Liu; Duygu Ceylan; James M. Rehg; Tobias Hinz

arXiv:2505.07652·cs.CV·May 13, 2025

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, Tobias Hinz

PDF

Open Access

TL;DR

ShotAdapter introduces a novel framework for generating multi-shot videos from text prompts, enabling discrete shot transitions, character consistency, and user control over shot content and timing using diffusion models.

Contribution

The paper presents a new dataset collection pipeline and architectural extensions for diffusion models to enable text-to-multi-shot video generation with shot-specific control.

Findings

01

Fine-tuning a pre-trained model enables multi-shot video generation.

02

The approach outperforms existing baselines.

03

The method ensures character and background consistency across shots.

Abstract

Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Diffusion