Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

Lorenzo Jaime Yu Flores; Ori Ernst; Jackie Chi Kit Cheung

arXiv:2506.00637·cs.CL·June 16, 2025

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

Lorenzo Jaime Yu Flores, Ori Ernst, Jackie Chi Kit Cheung

PDF

Open Access 1 Video

TL;DR

This paper introduces task-agnostic confidence metrics based on output distribution characteristics to improve calibration in text generation models, enhancing their reliability across various tasks.

Contribution

It proposes novel confidence metrics that rely solely on output probabilities, improving calibration without additional fine-tuning or heuristics.

Findings

01

Improved calibration of BART and Flan-T5 models on summarization, translation, and QA datasets.

02

Metrics are task-agnostic and do not require additional training.

03

Enhanced reliability of confidence scores in text generation.

Abstract

Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques