MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling
Bowen Zhang, Xiaofei Xie, Haotian Lu, Na Ma, Tianlin Li, Qing Guo

TL;DR
MAVIN is a diffusion-based model that generates seamless transition videos between two given segments, addressing challenges in multi-action video generation by focusing on smoothness, coherence, and long-term consistency.
Contribution
MAVIN introduces innovative techniques like boundary frame guidance and Gaussian filter mixer for effective transition video infilling, along with a new metric for evaluating temporal smoothness.
Findings
MAVIN outperforms existing methods in generating smooth transition videos.
The model effectively handles large infilling gaps and varied transition lengths.
Experimental results demonstrate superior temporal coherence and visual quality.
Abstract
Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To tackle this, we propose an intuitive and straightforward solution: splicing multiple single-action video segments sequentially. The core challenge lies in generating smooth and natural transitions between these segments given the inherent complexity and variability of action transitions. We introduce MAVIN (Multi-Action Video INfilling model), designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. MAVIN incorporates several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Coding and Compression Technologies
