Efficient Benchmarking of AI Agents
Franck Ndzomga

TL;DR
This paper proposes a cost-effective benchmarking protocol for AI agents that maintains reliable rankings by evaluating only on intermediate difficulty tasks, reducing evaluation costs significantly.
Contribution
It introduces a novel, optimization-free method based on Item Response Theory to select tasks that preserve agent rankings under distribution shifts.
Findings
Rank-order prediction remains stable despite score degradation.
Evaluating on intermediate difficulty tasks reduces evaluation effort by up to 70%.
The method outperforms random sampling and greedy selection in maintaining ranking fidelity.
Abstract
Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
