BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition   Capabilities of Language Models in Multi-Agent Systems

Wei Wang; Dan Zhang; Tao Feng; Boyan Wang; Jie Tang

arXiv:2408.15971·cs.CL·August 29, 2024

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang

PDF

Open Access

TL;DR

BattleAgentBench is a comprehensive benchmark designed to evaluate the nuanced collaboration and competition capabilities of various language models in multi-agent scenarios, revealing strengths and gaps across different model types and difficulty levels.

Contribution

It introduces a fine-grained evaluation framework with seven sub-stages across three difficulty levels, specifically targeting multi-agent collaboration and competition capabilities of language models.

Findings

01

API-based models excel on simple tasks

02

Open-source small models struggle with simple tasks

03

API models show some collaborative abilities but need improvement

Abstract

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation