Calibration tests beyond classification
David Widmann, Fredrik Lindsten, Dave Zachariah

TL;DR
This paper introduces a unified framework for evaluating calibration in probabilistic models across classification and regression tasks, generalizing existing measures and tests to improve interpretability and applicability.
Contribution
It proposes the first comprehensive framework that unifies calibration evaluation and testing for all probabilistic predictive models, including regression and multi-class classification.
Findings
Generalizes kernel calibration error and tests to scalar-valued kernels
Applies calibration evaluation to real-valued regression problems
Provides a more intuitive reformulation of calibration measures
Abstract
Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over- nor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models · Statistical Methods and Inference
