How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
Zhengbao Jiang, Jun Araki, Haibo Ding, Graham Neubig

TL;DR
This paper investigates how well language models' confidence scores reflect their actual correctness in question answering, evaluates calibration methods, and demonstrates improvements through various calibration techniques.
Contribution
It provides a comprehensive analysis of calibration in language models for QA and introduces effective methods for improving their confidence estimates.
Findings
Language models are poorly calibrated in QA tasks.
Calibration methods improve confidence correctness correlation.
Fine-tuning and post-hoc adjustments enhance model calibration.
Abstract
Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Weight Decay · Adam · GPT-2 · BART · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia?
