The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims
Kiana Jafari Meimandi, Gabriela Ar\'anguiz-Dias, Grace Ra Kim, Lana Saadeddin, Allie Griffith, Mykel J. Kochenderfer

TL;DR
This paper critiques current evaluation practices for agentic AI, revealing a systemic bias towards technical metrics and advocating for a balanced, multi-dimensional assessment framework to better predict real-world success.
Contribution
It identifies a measurement imbalance in agentic AI evaluation and proposes a four-axis model to improve assessment of real-world deployment potential.
Findings
Technical metrics dominate 83% of evaluations
Human and safety assessments are underrepresented
Systems often fail in real-world settings despite technical success
Abstract
As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. However, this paper demonstrates that current evaluation practices for agentic AI systems exhibit a systemic imbalance that calls into question prevailing industry productivity claims. Our systematic review of 84 papers (2023--2025) reveals an evaluation imbalance where technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic assessments (30%) remain peripheral, with only 15% incorporating both technical and human dimensions. This measurement gap creates a fundamental disconnect between benchmark success and deployment value. We present evidence from healthcare, finance, and retail…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImpact of AI and Big Data on Business and Society
