Does Calibration Affect Human Actions?

Meir Nizri; Amos Azaria; Chirag Gupta; Noam Hazon

arXiv:2508.18317·cs.HC·August 27, 2025

Does Calibration Affect Human Actions?

Meir Nizri, Amos Azaria, Chirag Gupta, Noam Hazon

PDF

3 Reviews

TL;DR

This study investigates how calibration of machine learning models influences human decision-making and trust, revealing that behavioral economics corrections improve decision correlation but do not affect trust perceptions.

Contribution

It introduces prospect theory-based corrections to calibration scores and demonstrates their impact on human-model decision alignment in HCI experiments.

Findings

01

Prospect theory correction increases decision-model correlation.

02

Calibration alone does not enhance trust.

03

Behavioral economics corrections are crucial for decision alignment.

Abstract

Calibration has been proposed as a way to enhance the reliability and adoption of machine learning classifiers. We study a particular aspect of this proposal: how does calibrating a classification model affect the decisions made by non-expert humans consuming the model's predictions? We perform a Human-Computer-Interaction (HCI) experiment to ascertain the effect of calibration on (i) trust in the model, and (ii) the correlation between decisions and predictions. We also propose further corrections to the reported calibrated scores based on Kahneman and Tversky's prospect theory from behavioral economics, and study the effect of these corrections on trust and decision-making. We find that calibration is not sufficient on its own; the prospect theory correction is crucial for increasing the correlation between human decisions and the model's predictions. While this increased correlation…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The proposed idea of using prospect theory on top of calibration to help align human perception with the model's predictions is nice and seems novel. - The experimental results support the claim of the paper that using prospect theory together with calibration increases correlation of individuals decisions with the model's prediction.

Weaknesses

- The methodological contribution itself is relatively small, the application of prospect theory to the problem is quite straightforward. - The study setting is somewhat limited in that the participants have to make decisions based on the predictions of the model only and have no other information available. This doesn't seem to be realistic in most assisted decision making scenarios, where the individual could ignore the model if they do not trust it and base the decision on their own knowledge

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The paper asks a relevant question. Traditional calibration is usually considered as the de-facto measure of reliability in popular machine learning literature. However, machine learning prediction systems are not built in isolation and have major implications how they affect human decision systems. Thus, studying the usability of calibration to actual human subjects is an insightful research question. 2. The introduction of prospect theory based post-hoc correction is also interesting to mak

Weaknesses

1. One of the crucial limitations of the paper is lack of thorough description of human study conducted. The paper claims that "there is no reported difference in the level of trust reported by the participants". However, without further information on the nature of instructions / guidelines provided to the human subjects, it could very well be the case that the subjects of this study behaved randomly (which is not an uncommon phenomenon, and is usually controlled for in user studies by designin

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

- The paper examines a very important problem: the link between confidence calibration and how humans make judgments using these confidence scores - The paper shows how a reweighting function (with ideas from decision theory) that can reweight confidences elicits more trust from humans than a simple calibrated model - The paper's ideas and results are crucial to creating trustable ML systems and would be very interesting to these communities

Weaknesses

- I think the paper's experimentation is lacking. - The current experimental setup is much too simplistic: 1. Just asking the users how much they trust the system can result in a lot of noise especially as users have no reason to be faithful. It seems that prior works usually measure some proxy for trust [1], or simulate an environment where where participant's trust is linked to some monetary risk/reward [2.3] - The authors show experiments on a single task, also the authors ignore the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.