Human-Centered Design Recommendations for LLM-as-a-Judge

Qian Pan; Zahra Ashktorab; Michael Desmond; Martin Santillan Cooper,; James Johnson; Rahul Nair; Elizabeth Daly; Werner Geyer

arXiv:2407.03479·cs.HC·July 8, 2024·1 cites

Human-Centered Design Recommendations for LLM-as-a-Judge

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper,, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer

PDF

Open Access 1 Video

TL;DR

This paper explores how to effectively integrate human input into LLM-based evaluation systems to improve trust, reliability, and alignment with human judgment, through a user-centered design approach.

Contribution

It introduces EvaluLLM, a design framework that combines human expertise with LLM evaluation, and provides practical recommendations based on expert interviews.

Findings

01

Human input is essential for aligning LLM evaluations with user expectations.

02

Design recommendations improve the effectiveness of human-assisted LLM evaluation systems.

03

Expert interviews highlight key challenges and solutions in developing evaluation criteria.

Abstract

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Human-Centered Design Recommendations for LLM-as-a-judge· underline

Taxonomy

TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Law, AI, and Intellectual Property