Stop Measuring Calibration When Humans Disagree
Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fern\'andez

TL;DR
This paper critiques the practice of measuring classifier calibration against human majority votes in inherently ambiguous tasks, proposing new instance-level metrics that better reflect human judgment uncertainty.
Contribution
It highlights the theoretical issues of current calibration measures in disagreement scenarios and introduces novel instance-level calibration metrics capturing human judgment variability.
Findings
Measuring calibration to human majority is problematic when humans disagree.
Proposed instance-level measures include class frequency, ranking, and entropy.
Empirical validation on the ChaosNLI dataset demonstrates the effectiveness of new metrics.
Abstract
Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Statistical Mechanics and Entropy · Advanced Statistical Methods and Models
