Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

Ishani Mondal; Shweta Bhardwaj

arXiv:2605.06856·cs.LG·May 12, 2026

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

Ishani Mondal, Shweta Bhardwaj

PDF

TL;DR

This paper highlights the gap between benchmark performance and real-world utility in generative AI, proposing a new evaluation framework focused on human outcomes and stakeholder goals.

Contribution

It introduces SCU-GenEval, a four-stage utility-based evaluation framework and supporting tools to better assess generative AI in real-world deployment contexts.

Findings

01

Generative AI often fails to improve stakeholder goals despite benchmark success.

02

The proposed framework enables longitudinal measurement of AI's impact on human capabilities.

03

Supporting instruments facilitate practical deployment of utility-based evaluation methods.

Abstract

Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.