Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

Yee-Yin Choong; Kristen Greene; Alice Qian; Meryem Marasli; Ziqi Yang; Sophia Chen; Laura Dabbish; Anand Rao; Hong Shen

arXiv:2605.07986·cs.HC·May 11, 2026

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

Yee-Yin Choong, Kristen Greene, Alice Qian, Meryem Marasli, Ziqi Yang, Sophia Chen, Laura Dabbish, Anand Rao, Hong Shen

PDF

TL;DR

This paper proposes a structured, human-centered process for transforming high-level AI use cases into detailed, operational evaluation scenarios to enable more consistent and meaningful AI comparisons.

Contribution

It introduces a repeatable process using a structured worksheet and human reviews to generate detailed evaluation scenarios from high-level use cases.

Findings

01

Demonstrated utility in the U.S. financial sector

02

Generated 107 scenarios from SME elicited use cases

03

Validated scenario quality with a specialized rubric

Abstract

AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.