Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability

Tu Anh Dinh; Jan Niehues

arXiv:2502.11115·cs.CL·September 16, 2025

Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability

Tu Anh Dinh, Jan Niehues

PDF

Open Access 1 Video

TL;DR

This paper introduces BoostedProb, a novel quality estimation method for text-generation models that adjusts model confidence to better reflect output quality, outperforming raw probabilities and rivaling more complex methods.

Contribution

The paper proposes BoostedProb, a simple yet effective confidence boosting technique that improves quality estimation by accounting for multiple correct output options.

Findings

01

BoostedProb improves Pearson correlation by +0.194 on average.

02

It outperforms raw model probability in quality estimation.

03

It rivals or surpasses more costly supervised or ensemble QE methods.

Abstract

Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models' output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower probability does not necessarily mean lower output quality. Due to this observation, we propose a QE approach called BoostedProb, which boosts the model's confidence in cases where there are multiple viable output options. With no increase in complexity, BoostedProb is notably better than raw model probability in different settings, achieving on average +0.194 improvement in Pearson correlation to ground-truth quality. It also comes close to or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability· underline

Taxonomy

TopicsProbability and Statistical Research · Statistical Methods in Clinical Trials · Data Quality and Management