Evaluating Metrics for Safety with LLM-as-Judges
Kester Clegg, Richard Hawkins, Ibrahim Habli, and Tom Lawton

TL;DR
This paper explores how to evaluate the safety of Large Language Models acting as judges in critical tasks by using weighted metrics and confidence thresholds to reduce errors and ensure reliability.
Contribution
It proposes a framework for evaluating LLMs as judges using a basket of metrics and confidence thresholds to improve safety and reliability in critical applications.
Findings
Weighted metrics can lower evaluation errors.
Context sensitivity helps define error severity.
Confidence thresholds enable human review triggers.
Abstract
LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification
