Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li; Shaohan Yi; Zheng Liu; Leonartinus Gao; Minh Ngoc Le; Ambrose Ling; Zhuoran Wang; Md Amirul Islam; Zhixiang Chi; Yuanhao Yu

arXiv:2512.20000·cs.CV·January 1, 2026

Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le, Ambrose Ling, Zhuoran Wang, Md Amirul Islam, Zhixiang Chi, Yuanhao Yu

PDF

Open Access

TL;DR

This paper introduces MIVA, a lightweight modular adapter for diffusion models that enables efficient, precise, and flexible image-to-video animation with minimal training data and without extensive prompt engineering.

Contribution

The paper presents MIVA, a novel modular adapter that allows pre-trained diffusion models to generate animated videos with limited data and enhanced motion control capabilities.

Findings

01

MIVA can be trained on about ten samples using a single GPU.

02

It enables users to specify motion patterns without prompt engineering.

03

MIVA achieves higher or comparable quality to larger dataset models.

Abstract

Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis