Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Hao Yang; Zhiyu Tan; Jia Gong; Luozheng Qin; Hesen Chen; Xiaomeng Yang; Yuqing Sun; Yuetan Lin; Mengping Yang; Hao Li

arXiv:2602.08820·cs.CV·March 16, 2026

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, Hao Li

PDF

Open Access

TL;DR

Omni-Video 2 introduces a scalable, efficient model that leverages pretrained multimodal large-language models and diffusion techniques to enhance unified video generation and editing capabilities, especially for complex tasks.

Contribution

It presents a novel approach combining MLLMs with video diffusion models via a lightweight adapter, enabling high-quality, parameter-efficient video generation and editing.

Findings

01

Achieves superior performance on FiVE and VBench benchmarks.

02

Supports high-quality text-to-video generation.

03

Excels in complex, compositional video editing tasks.

Abstract

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization