UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Houyuan Chen; Hong Li; Xianghao Kong; Tianrui Zhu; Shaocong Xu; Weiqing Xiao; Yuwei Guo; Chongjie Ye; Lvmin Zhang; Hao Zhao; Anyi Rao

arXiv:2605.00658·cs.CV·May 4, 2026

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu, Weiqing Xiao, Yuwei Guo, Chongjie Ye, Lvmin Zhang, Hao Zhao, Anyi Rao

PDF

2 Repos 1 Models

TL;DR

UniVidX is a unified multimodal video generation framework using diffusion priors, enabling versatile, cross-modal, and high-quality video synthesis across various tasks and modalities.

Contribution

It introduces a novel unified framework with key designs like SCM, DGL, and CMSA for flexible, cross-modal video generation, outperforming prior task-specific models.

Findings

01

Achieves state-of-the-art performance on multiple video generation tasks.

02

Generalizes well to in-the-wild scenarios with limited training data.

03

Supports diverse modalities including RGB videos, intrinsic maps, and RGBA layers.

Abstract

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
houyuanchen/UniVidX
model· ♡ 26
♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.