BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran

TL;DR
BenchAgents is a multi-agent framework that automates the creation of high-quality evaluation benchmarks for complex generative tasks using large language models, enabling more comprehensive model assessment.
Contribution
It introduces a novel multi-agent system that automates benchmark creation, ensuring quality and diversity, and applies it to evaluate and analyze advanced AI models.
Findings
Created new benchmarks for planning, constraint satisfaction, and causal reasoning.
Identified common failure modes in state-of-the-art models.
Provided insights into model differences across language and vision tasks.
Abstract
Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies
