Efficient Benchmarking of AI Agents

Franck Ndzomga

arXiv:2603.23749·cs.AI·March 26, 2026

Efficient Benchmarking of AI Agents

Franck Ndzomga

PDF

Open Access

TL;DR

This paper proposes a cost-effective benchmarking protocol for AI agents that maintains reliable rankings by evaluating only on intermediate difficulty tasks, reducing evaluation costs significantly.

Contribution

It introduces a novel, optimization-free method based on Item Response Theory to select tasks that preserve agent rankings under distribution shifts.

Findings

01

Rank-order prediction remains stable despite score degradation.

02

Evaluating on intermediate difficulty tasks reduces evaluation effort by up to 70%.

03

The method outperforms random sampling and greedy selection in maintaining ranking fidelity.

Abstract

Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications