DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei; Xinyu Liu; Shiwei Zhang; Hangjie Yuan; Jinbo Xing; Zhekai Chen; Xiang Wang; Haonan Qiu; Rui Zhao; Yutong Feng; Ruihang Chu; Yingya Zhang; Yike Guo; Xihui Liu; Hongming Shan

arXiv:2603.12257·cs.CV·March 13, 2026

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan

PDF

Open Access

TL;DR

DreamVideo-Omni introduces a novel framework for multi-subject video customization with precise omni-motion control, leveraging a two-stage training process, advanced control signals, and latent identity reinforcement to enhance identity preservation and motion accuracy.

Contribution

The paper presents a unified two-stage training paradigm with innovative control mechanisms and latent identity reinforcement for improved multi-subject video synthesis.

Findings

01

Superior identity preservation in multi-subject videos

02

Enhanced motion control accuracy and granularity

03

Effective disentanglement of complex multi-subject scenes

Abstract

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation