DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo; Fulong Ye; Qichao Sun; Liyang Chen; Bingchuan Li; Pengze Zhang; Jiawei Liu; Songtao Zhao; Qian He; Xiangwang Hou

arXiv:2602.12160·cs.CV·February 13, 2026

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou

PDF

Open Access 2 Models

TL;DR

DreamID-Omni introduces a unified, controllable framework for human-centric audio-video generation, effectively managing multiple identities and voices with novel disentanglement and training strategies, achieving state-of-the-art results.

Contribution

The paper presents a novel symmetric conditional diffusion transformer, dual-level disentanglement, and multi-task progressive training for comprehensive controllable audio-video synthesis.

Findings

01

Achieves state-of-the-art performance across multiple audio-visual tasks.

02

Effectively disentangles identities and voice timbres in multi-person scenarios.

03

Outperforms leading commercial models in quality and consistency.

Abstract

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications