STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin; Wei Liu; Chen Chen; Jiasen Lu; Wenze Hu; Tsu-Jui Fu; Jesse Allardice; Zhengfeng Lai; Liangchen Song; Bowen Zhang; Cha Chen; Yiran Fei; Lezhi Li; Yizhou Sun; Kai-Wei Chang; Yinfei Yang

arXiv:2412.07730·cs.CV·October 7, 2025

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang

PDF

Open Access

TL;DR

STIV introduces a scalable, unified framework for text and image conditioned video generation using diffusion transformers, achieving state-of-the-art results across multiple tasks with a simple design.

Contribution

The paper presents a systematic study and a simple, scalable method for text-image conditioned video generation, integrating image and text conditioning into a diffusion transformer architecture.

Findings

01

Achieves 83.1 on VBench T2V, surpassing existing models.

02

Achieves 90.1 on VBench I2V, setting a new state-of-the-art.

03

Demonstrates versatility across tasks like video prediction and frame interpolation.

Abstract

The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing