Evolution of Video Generative Foundations

Teng Hu; Jiangning Zhang; Hongrui Huang; Ran Yi; Zihan Su; Jieyu Weng; Zhucun Xue; Lizhuang Ma; Ming-Hsuan Yang; Dacheng Tao

arXiv:2604.06339·cs.CV·April 9, 2026

Evolution of Video Generative Foundations

Teng Hu, Jiangning Zhang, Hongrui Huang, Ran Yi, Zihan Su, Jieyu Weng, Zhucun Xue, Lizhuang Ma, Ming-Hsuan Yang, Dacheng Tao

PDF

1 Repo

TL;DR

This survey comprehensively reviews the evolution of video generation technologies, from early GANs to diffusion and autoregressive models, highlighting trends, challenges, and future directions in multimodal and contextual video synthesis.

Contribution

It provides the first systematic overview of the development of video generation, including foundational principles, key advancements, and integration of multimodal data.

Findings

01

Diffusion models now dominate video generation.

02

Emerging autoregressive and multimodal techniques enhance contextual understanding.

03

Historical trends inform future research directions.

Abstract

The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjtuplayer/Awesome-Video-Foundations
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.