How Can We Know When Language Models Know? On the Calibration of   Language Models for Question Answering

Zhengbao Jiang; Jun Araki; Haibo Ding; Graham Neubig

arXiv:2012.00955·cs.CL·May 21, 2021·40 cites

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Zhengbao Jiang, Jun Araki, Haibo Ding, Graham Neubig

PDF

Open Access 1 Repo

TL;DR

This paper investigates how well language models' confidence scores reflect their actual correctness in question answering, evaluates calibration methods, and demonstrates improvements through various calibration techniques.

Contribution

It provides a comprehensive analysis of calibration in language models for QA and introduces effective methods for improving their confidence estimates.

Findings

01

Language models are poorly calibrated in QA tasks.

02

Calibration methods improve confidence correctness correlation.

03

Fine-tuning and post-hoc adjustments enhance model calibration.

Abstract

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jzbjyb/lm-calibration
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Weight Decay · Adam · GPT-2 · BART · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia?