Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin; Jia Gong; Qian Qiao; Tianjiao Li; Li Xu; Haoyu Pan; Chao Qu; Zhiyu Tan; Hao Li

arXiv:2604.08121·cs.CV·April 10, 2026

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li

PDF

2 Repos

TL;DR

Uni-ViGU introduces a unified framework for video generation and understanding that leverages a diffusion-based generator and a novel training mechanism to achieve competitive multimodal performance.

Contribution

It proposes a new unified approach extending a video generator with a flow matching method and a modality-driven MoE framework for joint video understanding and generation.

Findings

01

Achieves competitive results in video generation and understanding tasks.

02

Introduces a bidirectional training mechanism for shared representations.

03

Demonstrates the scalability of generation-centric architectures for multimodal intelligence.

Abstract

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.