BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

Natasha Butt; Varun Chandrasekaran; Neel Joshi; Besmira Nushi; Vidhisha Balachandran

arXiv:2410.22584·cs.LG·October 8, 2025

BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran

PDF

Open Access

TL;DR

BenchAgents is a multi-agent framework that automates the creation of high-quality evaluation benchmarks for complex generative tasks using large language models, enabling more comprehensive model assessment.

Contribution

It introduces a novel multi-agent system that automates benchmark creation, ensuring quality and diversity, and applies it to evaluate and analyze advanced AI models.

Findings

01

Created new benchmarks for planning, constraint satisfaction, and causal reasoning.

02

Identified common failure modes in state-of-the-art models.

03

Provided insights into model differences across language and vision tasks.

Abstract

Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies