TL;DR
Uni-ViGU introduces a unified framework for video generation and understanding that leverages a diffusion-based generator and a novel training mechanism to achieve competitive multimodal performance.
Contribution
It proposes a new unified approach extending a video generator with a flow matching method and a modality-driven MoE framework for joint video understanding and generation.
Findings
Achieves competitive results in video generation and understanding tasks.
Introduces a bidirectional training mechanism for shared representations.
Demonstrates the scalability of generation-centric architectures for multimodal intelligence.
Abstract
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
