Measuring Uncertainty Calibration
Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini, David Gustafsson, Eva Garcia-Martin, Carmen Barcena Gonzalez, Rapha\"elle Bertrand-Lalo

TL;DR
This paper introduces new methods to estimate and improve the calibration error of binary classifiers, providing practical, distribution-free bounds and modifications that do not compromise classifier performance.
Contribution
It offers an upper bound for calibration error with bounded variation and a method to modify classifiers for better calibration without restrictive assumptions.
Findings
Provided a non-asymptotic upper bound for calibration error.
Developed a practical method to modify classifiers for improved calibration.
Results are distribution-free and applicable to real-world datasets.
Abstract
We make two contributions to the problem of estimating the calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.
Peer Reviews
Decision·ICLR 2026 Poster
* The paper introduced novel ways to calculate bounds for calibration error. The authors explained these bounds in detail and provided proofs. The explanations in the main part of the paper were very well written, providing most of the necessary intuitions and insights to be able to follow the paper. * The related work is well-connected to this work. * The experiments complement the paper and showcase that their method is giving the best results.
* The paper would have benefitted from illustrations about the difference of $\eta$ and $\hat{\eta}$ for TV denoising and for kernel smoothing. Such illustrations could be done for some moderately non-monotonic function, e.g. with total variation above 1 but below 2. While the descriptions were all understandable for me eventually, I think most readers would benefit from such illustrative figures. There was some space remaining for this (if I understand correctly that LLM usage, reproducibility
The proposed methodologies are generally sound. To my knowledge, the derived estimators are new contributions to binary classification calibration domain. Paper is well-written and new calibration error upper bounds are motivated by the limitation of previous works and literature. Proposed theoretical results are supported by experimental results on both synthetic and real data. Although targeted on very specific classification setup, the contribution could provide useful information for the com
There are also some weaknesses affecting the clarity of the presentation and significance of the results. First, although the motivation and derivation of the proposed theoretical results seem ok, the presentation could be improved and supported by illustration of the problem at the beginning, including the problem setup and possible limitations of previous work. Second, the experimental evaluation could be more versatile to fully support the proposed techniques, including the comparison with se
- The proposed methods are non-asymptotic and distribution-free - The empirical validation includes both synthetic experiments with known ground truth and real-data experiments. Real-data experiments (Amazon Polarity, CIFAR, IMDB, Spam) demonstrate practical applicability.
**Missing related work** The paper would benefit from a more complete discussion of recent literature, e.g. https://proceedings.neurips.cc/paper_files/paper/2020/file/26d88423fc6da243ffddf161ca712757-Paper.pdf also addresses distribution-free calibration in binary classification, establishing fundamental limits and impossibility results in the absence of distributional assumptions. **Insufficient experimental analysis** The experimental section is almost entirely descriptive rather than anal
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications
