Re-Examining Calibration: The Case of Question Answering
Chenglei Si, Chen Zhao, Sewon Min, Jordan Boyd-Graber

TL;DR
This paper introduces a new calibration metric, MacroCE, and a calibration method, ConsCal, for open-domain question answering, highlighting limitations of traditional calibration evaluation and demonstrating the need for better calibration techniques.
Contribution
The paper proposes MacroCE as a more effective calibration metric and introduces ConsCal, a novel calibration method leveraging prediction consistency across checkpoints.
Findings
Traditional calibration methods do not improve MacroCE scores significantly.
MacroCE better captures the quality of confidence estimates in QA models.
ConsCal outperforms existing calibration techniques under the new metric.
Abstract
For users to trust model predictions, they need to understand model outputs, particularly their confidence - calibration aims to adjust (calibrate) models' confidence to match expected accuracy. We argue that the traditional calibration evaluation does not promote effective calibrations: for example, it can encourage always assigning a mediocre confidence score to all predictions, which does not help users distinguish correct predictions from wrong ones. Building on those observations, we propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions. Focusing on the practical application of open-domain question answering, we examine conventional calibration methods applied on the widely-used retriever-reader pipeline, all of which do not bring significant gains under our new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Access Control and Trust · Advanced Graph Neural Networks
