MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu

TL;DR
MMAR is a comprehensive benchmark designed to evaluate deep reasoning in audio-language models across diverse real-world audio tasks, emphasizing multi-step reasoning and domain-specific knowledge.
Contribution
This paper introduces MMAR, a large, multi-disciplinary audio reasoning benchmark with hierarchical questions and Chain-of-Thought annotations, expanding evaluation beyond existing domain-specific datasets.
Findings
Current models struggle with MMAR's complex reasoning tasks.
MMAR reveals significant limitations in existing audio reasoning capabilities.
Benchmark encourages development of more advanced, multi-step reasoning models.
Abstract
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/music-flamingo-2601-hfmodel· 9.5k dl· ♡ 899.5k dl♡ 89
- 🤗nvidia/audio-flamingo-3model· 340 dl· ♡ 145340 dl♡ 145
- 🤗nvidia/audio-flamingo-3-hfmodel· 165k dl· ♡ 176165k dl♡ 176
- 🤗nvidia/music-flamingo-think-2601-hfmodel· 912 dl· ♡ 33912 dl♡ 33
- 🤗nvidia/audio-flamingo-3-chatmodel· 210 dl· ♡ 48210 dl♡ 48
- 🤗nvidia/music-flamingo-hfmodel· 5.2k dl· ♡ 865.2k dl♡ 86
- 🤗henry1477/music-flamingo-ggufmodel· 456 dl· ♡ 3456 dl♡ 3
- 🤗henry1477/music-flamingo-2601-hf-fp8model· 173 dl· ♡ 1173 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training
