Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang; Jiahui Zhang; Yingchen Yu; Shijian Lu; Song Bai

arXiv:2508.01698·cs.CV·August 5, 2025

Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang, Jiahui Zhang, Yingchen Yu, Shijian Lu, Song Bai

PDF

Open Access

TL;DR

This paper introduces VTG, a versatile framework for generating smooth, high-quality video transitions from various inputs, addressing a gap in existing diffusion-based video generation methods.

Contribution

The paper proposes VTG, a novel transition generation framework with interpolation initialization and dual-directional fine-tuning, improving motion smoothness and fidelity in video transitions.

Findings

01

VTG outperforms existing methods on TransitBench across multiple tasks.

02

Interpolation-based initialization preserves object identity effectively.

03

Dual-directional motion fine-tuning enhances transition smoothness.

Abstract

Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications