An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi, Yang

TL;DR
This paper investigates the uncertainty in large language model evaluators, demonstrating how different strategies and fine-tuning can improve evaluation stability and OOD detection, with extensive experiments and a new uncertainty-aware model.
Contribution
It provides the first comprehensive analysis of uncertainty in LLM evaluators and introduces ConfiLM, a fine-tuned model that leverages uncertainty to enhance evaluation reliability.
Findings
Evaluation uncertainty varies with model size and family.
Prompting strategies can reduce evaluation uncertainty.
Uncertainty-aware fine-tuning improves OOD evaluation performance.
Abstract
As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further…
Peer Reviews
Decision·ICLR 2025 Poster
- The first paper that empirically examines the role of uncertainty within the context of LLMJ, offering a series of empirical findings that motivate the creation of an uncertainty-aware LLM evaluator. - Despite the large number of experiments, it is quite clear what the focus is in each of the experiments, and it's good to see some intuitions properly empirically verified. - The choice of underlying model is quite good, which helps with generalising some of the empirical findings.
- The paper adopts a simplified view on recent 'LLM-as-a-Judge' (LLMJ) literature, skipping some related work that aimed to improve confidence of LLMJ models via calibration (https://openreview.net/pdf?id=L3FHMoKZcS) or better prompt optimization (PO; https://arxiv.org/pdf/2406.11370). Improving LLMJ methods requires a multi-component/multi-aspectual approach, which requires combinations of prompt optimisation, calibration and uncertainty mitigation strategies, and the paper would be much strong
The paper addresses a timely and increasingly relevant issue, the stability of LLMs-as-Judges. While the use of log probabilities as a measure of confidence is not novel in itself, the original contribution lies in its application within the context of evaluators. It investigates methods on how to improve the evaluators confidence and even recognize incorrect responses. The work is significant for its practical approach to improving LLM evaluator reliability, especially in OOD scenarios. The fi
- Previous studies, such as those by Lyu et al. [1], have shown that log probabilities do not always correlate with human preferences. This raises concerns about the reliability of using these metrics in LLM-as-Judge systems, which have become popular for their scalability and cost-effectiveness in automating evaluations that traditionally relied on human feedback. The paper could further explore how improving model confidence affects agreement with human evaluators. either via CoT or fine-tunin
- Since the paradigm of LLM-as-judge has become widely adopted for evaluation purposes, it is important to conduct an in-depth analysis of the stability of this evaluative framework. - This paper conducts extensive empirical analyses across multiple models and datasets, offering results that can help researchers deepen their understanding of various models as evaluators.
- In my opinion, the main issue with this paper is my uncertainty regarding **whether its research methodology adequately supports the conclusions it claims to draw**. - As demonstrated in the introduction, the core research question of this paper is "Can large language models provide consistent evaluation quality across different inputs and domains?" However, I am not convinced that the token probabilities predicted by the models sufficiently reflect the "evaluation quality" the authors aim
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods
MethodsSparse Evolutionary Training
