Stop Measuring Calibration When Humans Disagree

Joris Baan; Wilker Aziz; Barbara Plank; Raquel Fern\'andez

arXiv:2210.16133·cs.CL·December 1, 2022·1 cites

Stop Measuring Calibration When Humans Disagree

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fern\'andez

PDF

Open Access 1 Repo

TL;DR

This paper critiques the practice of measuring classifier calibration against human majority votes in inherently ambiguous tasks, proposing new instance-level metrics that better reflect human judgment uncertainty.

Contribution

It highlights the theoretical issues of current calibration measures in disagreement scenarios and introduces novel instance-level calibration metrics capturing human judgment variability.

Findings

01

Measuring calibration to human majority is problematic when humans disagree.

02

Proposed instance-level measures include class frequency, ranking, and entropy.

03

Empirical validation on the ChaosNLI dataset demonstrates the effectiveness of new metrics.

Abstract

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jsbaan/calibration-on-disagreement-data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Statistical Mechanics and Entropy · Advanced Statistical Methods and Models