Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

Peng Liu; Xiaoming Ren; Fengkai Liu; Qingsong Xie; Quanlong Zheng; Yanhao Zhang; Haonan Lu; Yujiu Yang

arXiv:2505.19901·cs.CV·June 4, 2025

Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

Peng Liu, Xiaoming Ren, Fengkai Liu, Qingsong Xie, Quanlong Zheng, Yanhao Zhang, Haonan Lu, Yujiu Yang

PDF

Open Access

TL;DR

Dynamic-I2V introduces a multimodal large language model-based framework that enhances image-to-video generation by improving motion control and temporal coherence, addressing complex scene understanding challenges.

Contribution

It integrates MLLMs with diffusion transformers for better multimodal encoding, and proposes DIVE, a new benchmark for evaluating dynamic quality in I2V generation.

Findings

01

Achieves 42.5% improvement in dynamic range.

02

Attains 7.9% better controllability.

03

Reaches 11.8% higher quality in generated videos.

Abstract

Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques

MethodsDiffusion