QA-Calibration of Language Model Confidence Scores

Putra Manggala; Atalanti Mastakouri; Elke Kirschbaum; Shiva Prasad; Kasiviswanathan; Aaditya Ramdas

arXiv:2410.06615·cs.CL·March 4, 2025

QA-Calibration of Language Model Confidence Scores

Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad, Kasiviswanathan, Aaditya Ramdas

PDF

Open Access 3 Reviews

TL;DR

This paper introduces QA-calibration, a new approach to ensure that confidence scores from generative QA systems are reliably calibrated across different question groups, improving decision-making in critical applications.

Contribution

The paper proposes QA-calibration, a generalized calibration notion for QA systems, along with discretized posthoc calibration methods and theoretical guarantees, validated on multiple benchmarks and models.

Findings

01

QA-calibration improves interpretability of confidence scores.

02

Discretized calibration schemes achieve distribution-free guarantees.

03

Validated on multiple QA benchmarks and large language models.

Abstract

To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The problem of calibration dependency on the dataset is highly significant. The QA pairs in different subsets can exhibit vastly different subset calibration errors, affecting the user's experience. To the best of my knowledge, this is the first paper to identify and directly address this issue. While the proposed methods are relatively straightforward adaptations of existing approaches, they represent an essential first step in tackling this problem. Although initial, this approach is an imp

Weaknesses

1. **Generalizability of $\beta$:** The paper is motivated by the idea that users may have specific interests in different groups of QAs, necessitating calibration that is tailored to user needs. However, - The paper demonstrates the method’s effectiveness using only one type of $\beta$ (partitioning based on DistillBERT embeddings), leaving it unclear how well these methods generalize to other partitioning strategies. Showing results with other partitioning methods or justifying this choice’s

Reviewer 02Rating 5Confidence 4

Strengths

This work defines a generalization of calibration error metrics. The work introduces methods for optimizing for their proposed calibration metric and theoretical results backing up their methods.

Weaknesses

W1. One concern with its work is a possible mischaracterization of how expected calibration error (ECE) is used to evaluate calibration. In the toy example in Table 1, the authors note that standard calibration error is 0; however, the often used ECE metric (from [1]) involves binning examples during test time by sorting and partitioning examples based on their predicted confidence. In the Toy example in Figure 1, ECE (with 2 bins) and beta-calibration are equivalent. While Beta-Calibration can

Reviewer 03Rating 6Confidence 5

Strengths

* The paper tackles an important problem (LLM confidence estimation) * The approach is validated on 5 different QA datasets. * The “selective” QA metric provides some insight into the practical benefit of improved calibration.

Weaknesses

* The approach assumes access to an “oracle” to measure semantic equivalence between a predicted and reference answer. I’m not sure if Llama-3.1 is such an oracle, unless perhaps it has memorized the QA datasets in question, which raises some other concerns. The approach requires dataset-specific calibration sets. I’m not sure if it’s fair to compare to baselines such as out-of-the-box prompts that don’t use this information. It’s also a bit of a limitation since it presumably makes it difficult

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies