Large Language Model Confidence Estimation via Black-Box Access
Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri

TL;DR
This paper presents a black-box framework for estimating confidence in large language model responses using engineered features and logistic regression, demonstrating improved accuracy and cross-model generalization.
Contribution
It introduces a simple, interpretable confidence estimation method for LLMs that works with black-box access and generalizes across models.
Findings
Outperforms baseline confidence estimators by over 10% in AUROC.
Effective across multiple LLMs including GPT-4, Llama, and Mistral.
Confidence models generalize zero-shot across different models.
Abstract
Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features…
Peer Reviews
Decision·Submitted to ICLR 2025
* This paper studies an important problem; assessing uncertainty in the language model’s output space. And they do so for open-ended generation tasks like summarization where uncertainty quantification is much harder (since you cannot look at the entropy at a specific token) *The authors have a wide-range of experimental analysis in different settings * Some of the findings I thought would be useful to the community — for example, I thought it was interesting that repeated sampling from the orig
* The primary weakness is the use of rouge score (a metric on top of n-gram overlap with ground truth summaries) for evaluation. Many of the perturbations, such as mixing up or repeating sentences with entities, help capture how entities are processed into the output (which rouge disproportionately captures), rather than actual correctness. * The procedure the authors propose is very inefficient; it requires many repeated samples from the language model. It also is not clear if this extra compu
1. Simple, extensible framework for wrapping black-box confidence scoring techniques into a meta-scorer, presented with several existing and novel confidence scoring techniques. 2. Useful interpretability results. 3. Good communication of approach, drawbacks, and strengths.
1. In 083, it is claimed that because the models used to produce predictions are simple, the confidence estimates will be well-calibrated. I think this claim needs support; in particular, the paper might make use of ECE (expected calibration error) and/or other calibration metrics to assess the calibration of different approaches including the newly proposed approach. 2. Results are hard to contextualize without error bars. 3. Baselines are mostly focused on previous black-box sampling-based bas
- The problem of producing confidence estimates for language models is an important problem. - The idea of aggregating multiple different sources of information across prompts is a natural direction which seems fruitful. - Studying the transferrability across language models is important for having general-purpose calibrators.
- The regime in which this paper operates could be more realistic. First, the models considered are relatively small models. Furthermore for these models, we actually have full white-box access, so evaluating methods that are meant for black-box access doesn't seem as well motivated. It would have been nice to see evaluations on closed API models (e.g., GPT-4, Claude, etc.). The contribution of this paper is fundamentally empirical, so the regime matters here. - The particular types of perturba
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
