MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang; Yun-Shao Tsai; Yu-Kai Guo; Ping-Le Tsai; Yen-Ting Piao; Hung-Wei Chen; Ting-Lin Hsiao; Yun-Man Hsu; Ke-Han Lu; Hung-yi Lee

arXiv:2603.09714·cs.SD·March 11, 2026

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces MUGEN, a benchmark for multi-audio understanding in large audio-language models, revealing their weaknesses and proposing training-free strategies to enhance robustness and accuracy.

Contribution

The paper presents MUGEN, the first comprehensive benchmark for multi-audio understanding, and demonstrates effective training-free methods to improve model performance and robustness.

Findings

01

Models' performance degrades with more concurrent audio inputs.

02

Audio-Permutational Self-Consistency improves accuracy by up to 6.28%.

03

Combining permutation with Chain-of-Thought further enhances results.

Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis