MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Guoli Yin; Haoping Bai; Shuang Ma; Feng Nan; Yanchao Sun; Zhaoyang Xu,; Shen Ma; Jiarui Lu; Xiang Kong; Aonan Zhang; Dian Ang Yap; Yizhe zhang,; Karsten Ahnert; Vik Kamath; Mathias Berglund; Dominic Walsh; Tobias Gindele,; Juergen Wiest; Zhengfeng Lai; Xiaoming Wang; Jiulong Shan; Meng Cao; Ruoming; Pang; Zirui Wang

arXiv:2407.18961·cs.AI·August 19, 2024·1 cites

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu,, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang,, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele,, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MMAU is a comprehensive benchmark designed to evaluate large language models across diverse domains and skills, providing detailed insights into their capabilities and limitations without complex environment setups.

Contribution

This paper introduces MMAU, a holistic offline benchmark with 20 tasks across five domains, enabling detailed assessment of LLM agent skills and performance.

Findings

01

Evaluated 18 models revealing varied strengths and weaknesses.

02

Provided insights into model capabilities across multiple domains.

03

Enhanced interpretability of LLM performance.

Abstract

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/axlearn
jaxOfficial

Datasets

apple/mmau
dataset· 354 dl
354 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Semantic Web and Ontologies

MethodsFocus