Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria

Annalisa Szymanski; Simret Araya Gebreegziabher; Oghenemaro Anuyah; Ronald A. Metoyer; Toby Jia-Jun Li

arXiv:2410.02054·cs.HC·February 17, 2026

Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria

Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A. Metoyer, Toby Jia-Jun Li

PDF

Open Access

TL;DR

This paper explores how domain experts, lay users, and LLMs develop different evaluation criteria for assessing LLM outputs, proposing a staged workflow that leverages their complementary strengths to improve evaluation effectiveness.

Contribution

It introduces a novel staged evaluation workflow integrating criteria from experts, lay users, and LLMs, with guidelines for balancing quality, cost, and scalability.

Findings

01

Experts produce fact-based, long-term value criteria.

02

Lay users focus on usability and short-term criteria.

03

LLMs target procedural, immediate task checks.

Abstract

Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods