TL;DR
Audio-Omni is a comprehensive framework that unifies audio understanding, generation, and editing across multiple domains using multimodal reasoning and high-fidelity synthesis, supported by a large-scale dataset.
Contribution
It introduces the first end-to-end unified model for versatile audio tasks across sound, music, and speech, combining a multimodal LLM and a diffusion transformer.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Outperforms prior unified approaches and matches specialized models.
Demonstrates advanced capabilities like zero-shot cross-lingual control.
Abstract
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
