Image-to-Video Diffusion: From Foundations to Open Frontiers
Xianlong Wang, Wenbo Pan, Shijia Zhou, Ke Li, Yuqi Wang, Zeyu Ye, Hangtao Zhang, Leo Yu Zhang, Xiaohua Jia

TL;DR
This paper systematically reviews diffusion-based image-to-video generation, proposing a taxonomy and analyzing core design choices, challenges, and future directions in this emerging field.
Contribution
It provides the first dedicated taxonomy and systematic analysis of diffusion I2V generation, highlighting key design principles and open challenges.
Findings
Organized existing methods into a taxonomy based on architecture and training paradigm.
Identified four core design components: condition encoding, temporal modeling, noise prior, spatial-temporal upsampling.
Discussed major open challenges and application scenarios in diffusion I2V generation.
Abstract
Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
