MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran, Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

TL;DR
MMAU is a comprehensive benchmark for evaluating multimodal audio understanding models on complex tasks involving speech, sounds, and music, emphasizing reasoning and domain-specific knowledge.
Contribution
Introduces MMAU, a new benchmark with 10k audio clips and complex reasoning questions to challenge and advance audio understanding models.
Findings
Most advanced models achieve around 53% accuracy.
MMAU reveals significant gaps in current audio understanding capabilities.
Benchmark encourages development of more sophisticated multimodal audio models.
Abstract
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The topic of this paper is very important — currently, most audio LLMs are benchmarked using different evals, and researchers must test other models for comparison, which is tedious and impractical for many. A unified benchmark is common in text and vision, such as MMLU, MMMU, and DocVQA, and has already demonstrated its effectiveness as a good measure of models. We need one for audio. 2. The authors also benchmarked 18 LLMs, which is a non-trivial task as some open-source repositories are
1. I would like to see more discussion for an ICLR submission. Although the authors did significant work, I encourage them to add more insights to the results, such as exploring why one model performs better than another in specific benchmarks. - e.g., is the difference of a specific test more likely from the training data (e.g., Google's model trained on more data than those from academia), or from the architecture/training (e.g., early fusion, late fusion, discrete token vs continuous embeddin
- In this paper, a large-scale audio understanding benchamrk "MMAU" is built to evaluate LALMs. Different from previous benchmarks, MMAU not only pays more attention to deeper and more difficult auditory reasoning tasks, but also covers a wide range of sound signals as well. - I believe MMAU is a reliable benchmark, as it is carefully designed during the build process with human review at each step. - Most exsiting LALMs are evaluated on MMAU in this paper. Besides, the authors have analysed the
- It seems that MMAU primarily focuses on short audio (around 10 seconds) and lacks evaluations involving perception and reasoning for long audio. Since long audio includes more contextual information, the model's ability to understand long audio might offer a broader indication of its overall performance. - MMAU is still predominantly multiple choice, but existing LALMs may not be good at multiple choice. Perhaps open-ended questions could be used as prompts instead when testing.
- MMAU significantly improves upon existing benchmarks by covering 27 distinct skills across three domains (speech, sound, and music). The low accuracy scores of state-of-the-art models highlight the benchmark's difficulty and push for more advanced models to handle complex tasks. - It is the first benchmark for reasoning and expert-level knowledge extraction in literature, setting it apart from previous benchmarks focused on foundational audio processing tasks. This aligns with the growing dema
- MMAU focuses solely on multiple-choice tasks, which could skew results towards models trained for MCQ-type question-answering and possibly even contrastive models. It would be beneficial to include an open-ended subset, even a small one, to contrast performance with the close-ended tasks. - The current version treats skills needed for information extraction and reasoning as separate, potentially oversimplifying the evaluation of tasks requiring a combination of skills. - [Minor] The contrastiv
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
