AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

Yan Rong; Jinting Wang; Guangzhi Lei; Shan Yang; Li Liu

arXiv:2505.22053·cs.SD·August 6, 2025

AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, Li Liu

PDF

Open Access 1 Datasets

TL;DR

AudioGenie introduces a training-free multi-agent framework that enhances multimodality-to-multiaudio generation by improving understanding, diversity, and reliability of synthesized audio from multimodal inputs, supported by a new benchmark dataset.

Contribution

It proposes a novel multi-agent system with dual-layer architecture and self-correction for MM2MA, along with the first benchmark dataset for this task.

Findings

01

Achieves state-of-the-art performance across 8 tasks.

02

Demonstrates improved audio quality, accuracy, and alignment.

03

User studies confirm effectiveness in aesthetics and reliability.

Abstract

Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ryysayhi/MA-Bench
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing