Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian; Binxin Yang; Zhaoyang Liu; Jiexuan Zhang; Ruibin Yuan; Hubery Yin; Qifeng Chen; Chen Li; Jing Lyu; Wei Xue; Yike Guo

arXiv:2604.10708·cs.SD·April 28, 2026

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lyu, Wei Xue, Yike Guo

PDF

2 Repos 1 Models

TL;DR

Audio-Omni is a comprehensive framework that unifies audio understanding, generation, and editing across multiple domains using multimodal reasoning and high-fidelity synthesis, supported by a large-scale dataset.

Contribution

It introduces the first end-to-end unified model for versatile audio tasks across sound, music, and speech, combining a multimodal LLM and a diffusion transformer.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Outperforms prior unified approaches and matches specialized models.

03

Demonstrates advanced capabilities like zero-shot cross-lingual control.

Abstract

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
HKUSTAudio/Audio-Omni
model· ♡ 42
♡ 42

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.