The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims

Kiana Jafari Meimandi; Gabriela Ar\'anguiz-Dias; Grace Ra Kim; Lana Saadeddin; Allie Griffith; Mykel J. Kochenderfer

arXiv:2506.02064·cs.CY·October 3, 2025

The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims

Kiana Jafari Meimandi, Gabriela Ar\'anguiz-Dias, Grace Ra Kim, Lana Saadeddin, Allie Griffith, Mykel J. Kochenderfer

PDF

Open Access

TL;DR

This paper critiques current evaluation practices for agentic AI, revealing a systemic bias towards technical metrics and advocating for a balanced, multi-dimensional assessment framework to better predict real-world success.

Contribution

It identifies a measurement imbalance in agentic AI evaluation and proposes a four-axis model to improve assessment of real-world deployment potential.

Findings

01

Technical metrics dominate 83% of evaluations

02

Human and safety assessments are underrepresented

03

Systems often fail in real-world settings despite technical success

Abstract

As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. However, this paper demonstrates that current evaluation practices for agentic AI systems exhibit a systemic imbalance that calls into question prevailing industry productivity claims. Our systematic review of 84 papers (2023--2025) reveals an evaluation imbalance where technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic assessments (30%) remain peripheral, with only 15% incorporating both technical and human dimensions. This measurement gap creates a fundamental disconnect between benchmark success and deployment value. We present evidence from healthcare, finance, and retail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImpact of AI and Big Data on Business and Society