Graph-based Confidence Calibration for Large Language Models

Yukun Li; Sijia Wang; Lifu Huang; Li-Ping Liu

arXiv:2411.02454·cs.CL·May 23, 2025

Graph-based Confidence Calibration for Large Language Models

Yukun Li, Sijia Wang, Lifu Huang, Li-Ping Liu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a graph neural network-based method to improve confidence calibration in large language models by analyzing response consistency, leading to more trustworthy AI outputs in diverse scenarios.

Contribution

It presents a novel approach using a consistency graph and GNN to assess response correctness, enhancing confidence calibration for LLMs.

Findings

01

Strong calibration performance on benchmark datasets

02

Effective generalization to out-of-domain cases

03

Improves trustworthiness of LLM responses

Abstract

Reliable confidence estimation is essential for enhancing the trustworthiness of large language models (LLMs), especially in high-stakes scenarios. Despite its importance, accurately estimating confidence in LLM responses remains a significant challenge. In this work, we propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the LLM. Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct. Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper presents a novel method to confidence calibration in large language models by leveraging graph neural networks and the consistency among multiple model responses. 2. The manuscript is well-organized and the experiments are comprehensive.

Weaknesses

1. The premise proposed in this paper is if LLMs give similar response. Then there is less uncertainty, and these responses tend to have a high probability of being correct. It is common to use the idea of self-consistency to guide the selection of the most appropriate answer from multiple answers, but it may not be reasonable to use it as a probability to evaluate the correctness of an answer. For example, the LLM may exhibits generate the same wrong answer when sampling multiple answers. In t

Reviewer 02Rating 5Confidence 5

Strengths

* The paper considers an important problem (LLM confidence estimation). * A diverse set of baseline methods are included (although I would have liked more information about how they were tuned compared to the baseline). * The approach is applicable to “black box” models since it only relies on being able to sample from the model, rather than requiring access to the activations or full predictive distribution. * The idea of using a GNN to calibrate a set of responses given a measure of semantic s

Weaknesses

* The use of a fixed number of samples and K for K-means seems like it could introduce some issues. If there’s no/little ambiguity won’t this still return three different clusters? At least some more sensitivity analysis should be done for K but this seems like a fundamental issue with the approach since choosing the number of clusters is underspecified in general. * The quality of the approach seems like it depends strongly on the semantic similarity used. SBERT is quite an old approach and it’

Reviewer 03Rating 3Confidence 5

Strengths

1. The topic of the paper is important. A well calibrated confidence of LLM outputs can benefit a lot of domains. 2. The model exhibits strong generalizability across different datasets and LLMs, showing robustness against domain shifts, which is valuable for real-world applications. 3. The paper includes sensitivity analyses and comparisons across various configurations, demonstrating the stability and effectiveness of the method under different setups and highlighting the performance benefits

Weaknesses

1. The requirement for sampling multiple responses and constructing similarity graphs for each query can introduce substantial computational costs, limiting scalability for real-time or resource-constrained applications. 2. The motivation is not clear. First, how to choose the value of \tau in Eqn. is an extra challenge. Second, after having the correctness label, why do we need to leverage a GNN for classification? I think it is a typical text classification task. The motivation of choosing GNN

Reviewer 04Rating 5Confidence 4

Strengths

1. the paper is easy to follow 2. This work develops a graph-based method to determine the semantic uncertainty, which is interesting 3. better performance in OOD settings

Weaknesses

* **Main concern**: the scientific contribution of this work is limited. The proposed method is an extended work of the semantic uncertainty with GNNs * writing and presentation should be improved. Some claims appear too arbitrarily without any proof. Figure 1 is not very informative. (see questions below) * the proposed method requires labels, but other baselines such as self-checkgpt and semantic uncertainty require no further training * the use of the ROUGE to evaluate semantic equivalence i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsGraph Neural Network · Sparse Evolutionary Training