EvalAssist: A Human-Centered Tool for LLM-as-a-Judge
Zahra Ashktorab, Werner Geyer, Michael Desmond, Elizabeth M. Daly, Martin Santillan Cooper, Qian Pan, Erik Miehling, Tejaswini Pedapati, and Hyo Jin Do

TL;DR
EvalAssist is a human-centered tool that streamlines the process of evaluating large language model outputs by providing an interactive environment for developing criteria and leveraging LLMs as evaluators.
Contribution
It introduces a framework for building, testing, and sharing custom evaluation criteria and integrates LLM-based evaluation pipelines with harm detection capabilities.
Findings
Deployed internally with hundreds of users.
Supports customizable, portable evaluation criteria.
Includes specialized evaluators for harm detection.
Abstract
With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
