Large Language Model Confidence Estimation via Black-Box Access

Tejaswini Pedapati; Amit Dhurandhar; Soumya Ghosh; Soham Dan; Prasanna Sattigeri

arXiv:2406.04370·cs.CL·July 2, 2025

Large Language Model Confidence Estimation via Black-Box Access

Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri

PDF

Open Access 3 Reviews

TL;DR

This paper presents a black-box framework for estimating confidence in large language model responses using engineered features and logistic regression, demonstrating improved accuracy and cross-model generalization.

Contribution

It introduces a simple, interpretable confidence estimation method for LLMs that works with black-box access and generalizes across models.

Findings

01

Outperforms baseline confidence estimators by over 10% in AUROC.

02

Effective across multiple LLMs including GPT-4, Llama, and Mistral.

03

Confidence models generalize zero-shot across different models.

Abstract

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

* This paper studies an important problem; assessing uncertainty in the language model’s output space. And they do so for open-ended generation tasks like summarization where uncertainty quantification is much harder (since you cannot look at the entropy at a specific token) *The authors have a wide-range of experimental analysis in different settings * Some of the findings I thought would be useful to the community — for example, I thought it was interesting that repeated sampling from the orig

Weaknesses

* The primary weakness is the use of rouge score (a metric on top of n-gram overlap with ground truth summaries) for evaluation. Many of the perturbations, such as mixing up or repeating sentences with entities, help capture how entities are processed into the output (which rouge disproportionately captures), rather than actual correctness. * The procedure the authors propose is very inefficient; it requires many repeated samples from the language model. It also is not clear if this extra compu

Reviewer 02Rating 6Confidence 3

Strengths

1. Simple, extensible framework for wrapping black-box confidence scoring techniques into a meta-scorer, presented with several existing and novel confidence scoring techniques. 2. Useful interpretability results. 3. Good communication of approach, drawbacks, and strengths.

Weaknesses

1. In 083, it is claimed that because the models used to produce predictions are simple, the confidence estimates will be well-calibrated. I think this claim needs support; in particular, the paper might make use of ECE (expected calibration error) and/or other calibration metrics to assess the calibration of different approaches including the newly proposed approach. 2. Results are hard to contextualize without error bars. 3. Baselines are mostly focused on previous black-box sampling-based bas

Reviewer 03Rating 5Confidence 4

Strengths

- The problem of producing confidence estimates for language models is an important problem. - The idea of aggregating multiple different sources of information across prompts is a natural direction which seems fruitful. - Studying the transferrability across language models is important for having general-purpose calibrators.

Weaknesses

- The regime in which this paper operates could be more realistic. First, the models considered are relatively small models. Furthermore for these models, we actually have full white-box access, so evaluating methods that are meant for black-box access doesn't seem as well motivated. It would have been nice to see evaluations on closed API models (e.g., GPT-4, Claude, etc.). The contribution of this paper is fundamentally empirical, so the regime matters here. - The particular types of perturba

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis