Omni-Video: Democratizing Unified Video Understanding and Generation

Zhiyu Tan; Hao Yang; Luozheng Qin; Jia Gong; Mengping Yang; Hao Li

arXiv:2507.06119·cs.CV·March 16, 2026

Omni-Video: Democratizing Unified Video Understanding and Generation

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li

PDF

Open Access 1 Repo 1 Models

TL;DR

Omni-Video introduces a unified framework that leverages multimodal large language models and diffusion decoders to advance video understanding, generation, and editing with high efficiency and broad applicability.

Contribution

The paper presents a novel architecture and training scheme that enable unified video modeling using existing multimodal models and diffusion techniques, addressing current limitations.

Findings

01

Effective video generation, editing, and understanding demonstrated

02

Lightweight design enables fast training with limited data

03

Model generalizes well across multiple video tasks

Abstract

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sais-fuxi/omni-video
noneOfficial

Models

🤗
howellyoung1/OmniVideo11B
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsAdapter · Diffusion · Focus