TL;DR
UniVidX is a unified multimodal video generation framework using diffusion priors, enabling versatile, cross-modal, and high-quality video synthesis across various tasks and modalities.
Contribution
It introduces a novel unified framework with key designs like SCM, DGL, and CMSA for flexible, cross-modal video generation, outperforming prior task-specific models.
Findings
Achieves state-of-the-art performance on multiple video generation tasks.
Generalizes well to in-the-wild scenarios with limited training data.
Supports diverse modalities including RGB videos, intrinsic maps, and RGBA layers.
Abstract
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
