Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Zahra Ashktorab; Michael Desmond; Qian Pan; James M. Johnson; Martin Santillan Cooper; Elizabeth M. Daly; Rahul Nair; Tejaswini Pedapati; Hyo Jin Do; Werner Geyer

arXiv:2410.00873·cs.HC·August 7, 2025·2 cites

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Hyo Jin Do, Werner Geyer

PDF

Open Access

TL;DR

This paper investigates how machine learning practitioners use LLMs for task-specific evaluations, revealing factors that influence assessment strategies and providing design recommendations for better evaluation tools.

Contribution

It offers new insights into user interactions with LLM-based evaluators and proposes system improvements for more effective human-AI evaluation workflows.

Findings

01

Users perform more evaluations with direct assessment strategies.

02

Task-specific criteria influence evaluation modifications.

03

Changing evaluator models affects judgment quality.

Abstract

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. To support this process, effective front-end tools are critical for evaluation. Two common approaches for using LLMs as evaluators are direct assessment and pairwise comparison. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and assessment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImpact of AI and Big Data on Business and Society