MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu,, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang,, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele,, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang

TL;DR
MMAU is a comprehensive benchmark designed to evaluate large language models across diverse domains and skills, providing detailed insights into their capabilities and limitations without complex environment setups.
Contribution
This paper introduces MMAU, a holistic offline benchmark with 20 tasks across five domains, enabling detailed assessment of LLM agent skills and performance.
Findings
Evaluated 18 models revealing varied strengths and weaknesses.
Provided insights into model capabilities across multiple domains.
Enhanced interpretability of LLM performance.
Abstract
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Semantic Web and Ontologies
MethodsFocus
