UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Guozhen Zhang; Zixiang Zhou; Teng Hu; Ziqiao Peng; Youliang Zhang; Yi Chen; Yuan Zhou; Qinglin Lu; Limin Wang

arXiv:2511.03334·cs.CV·March 25, 2026

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

PDF

Open Access 1 Models

TL;DR

UniAVGen introduces a unified framework for joint audio and video generation that enhances synchronization and semantic consistency using asymmetric cross-modal interactions and innovative guidance strategies.

Contribution

It presents a novel dual-branch diffusion transformer architecture with asymmetric cross-modal interaction and face-aware modulation for improved audio-video synthesis.

Findings

01

Achieves better synchronization and semantic consistency with fewer training samples.

02

Enables multiple audio-video tasks within a single unified model.

03

Outperforms existing methods in key generative metrics.

Abstract

Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MCG-NJU/UniAVGen
model· 97 dl· ♡ 5
97 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis