Video Diffusion Transformers are In-Context Learners
Zhengcong Fei, Di Qiu, Debang Li, Changqian Yu, Mingyuan Fan

TL;DR
This paper introduces a simple in-context learning pipeline for video diffusion transformers that enables effective multi-scene video generation exceeding 30 seconds without additional training or model modifications.
Contribution
It proposes a straightforward method to enable in-context capabilities in video diffusion transformers, allowing for controllable, multi-scene video generation with minimal tuning.
Findings
Effective in-context generation of multi-scene videos over 30 seconds
High-fidelity videos that align well with prompts and maintain role consistency
No modifications needed for existing models
Abstract
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: () concatenate videos along spacial or time dimension, () jointly caption multi-scene video clips from one source, and () apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsDiffusion · ALIGN
