Black-box Uncertainty Quantification Method for LLM-as-a-Judge
Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth, M. Daly, Qian Pan, Mart\'in Santill\'an Cooper, James M. Johnson, Werner, Geyer

TL;DR
This paper introduces a novel black-box uncertainty quantification method for LLMs acting as evaluators, improving the trustworthiness and consistency of their assessments across various benchmarks.
Contribution
The paper presents a new black-box approach to quantify uncertainty in LLM evaluations, addressing a gap in trustworthiness and reliability in LLM-as-a-Judge systems.
Findings
Strong correlation between evaluation accuracy and uncertainty scores
Method enhances reliability of LLM-based assessments
Effective across multiple benchmark datasets
Abstract
LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear Engineering Thermal-Hydraulics
