AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang; Jun Wang; Chunyu Qiang; Feng Deng; Chen Zhang; Di Zhang; Kun Gai

arXiv:2508.00733·cs.SD·August 8, 2025

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai

PDF

Open Access

TL;DR

AudioGen-Omni introduces a unified multimodal diffusion transformer that generates high-quality, synchronized audio, speech, and song from video inputs, advancing cross-modal alignment and efficiency in audio generation tasks.

Contribution

It presents a novel joint training paradigm and a unified encoder for multimodal audio-visual generation, achieving state-of-the-art results across multiple audio synthesis tasks.

Findings

01

State-of-the-art performance on Text-to-Audio/Speech/Song tasks

02

Enhanced audio quality and semantic alignment

03

Efficient inference with 1.91 seconds for 8 seconds of audio

Abstract

We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and song coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis