Black-box Uncertainty Quantification Method for LLM-as-a-Judge

Nico Wagner; Michael Desmond; Rahul Nair; Zahra Ashktorab; Elizabeth; M. Daly; Qian Pan; Mart\'in Santill\'an Cooper; James M. Johnson; Werner; Geyer

arXiv:2410.11594·cs.LG·October 16, 2024

Black-box Uncertainty Quantification Method for LLM-as-a-Judge

Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth, M. Daly, Qian Pan, Mart\'in Santill\'an Cooper, James M. Johnson, Werner, Geyer

PDF

Open Access

TL;DR

This paper introduces a novel black-box uncertainty quantification method for LLMs acting as evaluators, improving the trustworthiness and consistency of their assessments across various benchmarks.

Contribution

The paper presents a new black-box approach to quantify uncertainty in LLM evaluations, addressing a gap in trustworthiness and reliability in LLM-as-a-Judge systems.

Findings

01

Strong correlation between evaluation accuracy and uncertainty scores

02

Method enhances reliability of LLM-based assessments

03

Effective across multiple benchmark datasets

Abstract

LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear Engineering Thermal-Hydraulics