Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Ishani Mondal, Shweta Bhardwaj

TL;DR
This paper highlights the gap between benchmark performance and real-world utility in generative AI, proposing a new evaluation framework focused on human outcomes and stakeholder goals.
Contribution
It introduces SCU-GenEval, a four-stage utility-based evaluation framework and supporting tools to better assess generative AI in real-world deployment contexts.
Findings
Generative AI often fails to improve stakeholder goals despite benchmark success.
The proposed framework enables longitudinal measurement of AI's impact on human capabilities.
Supporting instruments facilitate practical deployment of utility-based evaluation methods.
Abstract
Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
