Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, Nick Diakopoulos

TL;DR
This paper develops a human-centered, domain-specific evaluation framework for generative AI in journalism, addressing real-world applicability and stakeholder needs.
Contribution
It introduces a domain-oriented evaluation 'cookbook' and design requirements for contextually relevant, value-aligned AI assessments in journalism.
Findings
Engaged 23 journalism professionals in a workshop to inform evaluation design.
Identified domain-specific challenges in translating tasks to evaluation metrics.
Produced an evaluation structure and design guidelines for journalism AI applications.
Abstract
Benchmarks play a significant role in how technology companies communicate about model capabilities and how researchers and the public understand generative AI systems. However, existing benchmarks have been criticized for their failure to adequately capture real-world usages (i.e. ecological validity) or to measure underlying concepts (i.e. construct validity). Building on approaches in HCI, we adopt a human-centered design process to address such critiques. Working within the journalism domain we engaged 23 professionals in a workshop which informed the design of a domain-oriented evaluation ``cookbook''. Our workshop findings surface domain-specific challenges and tensions faced by designers in translating specific tasks to evaluation constructs, aligning metrics with domain-specific values, and balancing needs among different stakeholders when constructing evaluations. Through an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
