Video Diffusion Transformers are In-Context Learners

Zhengcong Fei; Di Qiu; Debang Li; Changqian Yu; Mingyuan Fan

arXiv:2412.10783·cs.CV·March 25, 2025

Video Diffusion Transformers are In-Context Learners

Zhengcong Fei, Di Qiu, Debang Li, Changqian Yu, Mingyuan Fan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple in-context learning pipeline for video diffusion transformers that enables effective multi-scene video generation exceeding 30 seconds without additional training or model modifications.

Contribution

It proposes a straightforward method to enable in-context capabilities in video diffusion transformers, allowing for controllable, multi-scene video generation with minimal tuning.

Findings

01

Effective in-context generation of multi-scene videos over 30 seconds

02

High-fidelity videos that align well with prompts and maintain role consistency

03

No modifications needed for existing models

Abstract

This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ( $i$ ) concatenate videos along spacial or time dimension, ( $ii$ ) jointly caption multi-scene video clips from one source, and ( $iii$ ) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

feizc/video-in-context
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsDiffusion · ALIGN