UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

TL;DR
UniAVGen introduces a unified framework for joint audio and video generation that enhances synchronization and semantic consistency using asymmetric cross-modal interactions and innovative guidance strategies.
Contribution
It presents a novel dual-branch diffusion transformer architecture with asymmetric cross-modal interaction and face-aware modulation for improved audio-video synthesis.
Findings
Achieves better synchronization and semantic consistency with fewer training samples.
Enables multiple audio-video tasks within a single unified model.
Outperforms existing methods in key generative metrics.
Abstract
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
