GRACE: A Granular Benchmark for Evaluating Model Calibration against   Human Calibration

Yoo Yeon Sung; Eve Fleisig; Yu Hou; Ishan Upadhyay; Jordan Lee; Boyd-Graber

arXiv:2502.19684·cs.CL·February 28, 2025

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee, Boyd-Graber

PDF

Open Access 1 Video

TL;DR

GRACE is a benchmark that evaluates language model calibration by comparing model responses to human responses across gradually revealing clues, enabling granular analysis of calibration errors.

Contribution

We introduce GRACE, a novel benchmark with a new metric CalScore, for detailed evaluation of language model calibration against human behavior.

Findings

01

Humans are better calibrated than models despite lower accuracy.

02

State-of-the-art models struggle on GRACE, indicating calibration challenges.

03

GRACE effectively measures progress in model calibration improvements.

Abstract

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration· underline

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques