Image-to-Video Diffusion: From Foundations to Open Frontiers

Xianlong Wang; Wenbo Pan; Shijia Zhou; Ke Li; Yuqi Wang; Zeyu Ye; Hangtao Zhang; Leo Yu Zhang; Xiaohua Jia

arXiv:2605.17248·cs.CV·May 19, 2026

Image-to-Video Diffusion: From Foundations to Open Frontiers

Xianlong Wang, Wenbo Pan, Shijia Zhou, Ke Li, Yuqi Wang, Zeyu Ye, Hangtao Zhang, Leo Yu Zhang, Xiaohua Jia

PDF

TL;DR

This paper systematically reviews diffusion-based image-to-video generation, proposing a taxonomy and analyzing core design choices, challenges, and future directions in this emerging field.

Contribution

It provides the first dedicated taxonomy and systematic analysis of diffusion I2V generation, highlighting key design principles and open challenges.

Findings

01

Organized existing methods into a taxonomy based on architecture and training paradigm.

02

Identified four core design components: condition encoding, temporal modeling, noise prior, spatial-temporal upsampling.

03

Discussed major open challenges and application scenarios in diffusion I2V generation.

Abstract

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.