AI Agents That Matter
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir,, Arvind Narayanan

TL;DR
This paper critically examines current AI agent benchmarks, identifies key shortcomings such as overemphasis on accuracy and lack of standardization, and proposes a framework for more practical, cost-effective, and reproducible evaluation methods.
Contribution
The paper introduces a comprehensive analysis of AI agent benchmarks, highlights their limitations, and proposes new evaluation practices including joint optimization of accuracy and cost, and methods to prevent overfitting.
Findings
Joint optimization of accuracy and cost can reduce expenses significantly.
Current benchmarks often lack proper holdout sets, leading to overfitting.
Standardized evaluation practices improve reproducibility and real-world applicability.
Abstract
AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
The threat of existential risk from AI· youtube
Taxonomy
TopicsComputability, Logic, AI Algorithms · Multi-Agent Systems and Negotiation
MethodsSoftmax · Attention Is All You Need · Focus
