Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing
Jaros{\l}aw B{\l}asiok, Preetum Nakkiran

TL;DR
This paper introduces SmoothECE, a kernel smoothing-based calibration measure and reliability diagram that addresses flaws in traditional binning methods, providing a consistent and visually interpretable calibration assessment.
Contribution
The paper proposes a novel kernel smoothing approach for calibration measures and reliability diagrams, ensuring consistency and improved interpretability over traditional binning methods.
Findings
SmoothECE is a well-behaved, consistent calibration measure.
The smoothed reliability diagram visually encodes the SmoothECE.
Provided Python package simplifies calibration measurement and visualization.
Abstract
Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures -- binning and ECE -- both suffer from well-known flaws (e.g. discontinuity). We show that a simple modification fixes both constructions: first smooth the observations using an RBF kernel, then compute the Expected Calibration Error (ECE) of this smoothed function. We prove that with a careful choice of bandwidth, this method yields a calibration measure that is well-behaved in the sense of (B{\l}asiok, Gopalan, Hu, and Nakkiran 2023a) -- a consistent calibration measure. We call this measure the SmoothECE. Moreover, the…
Peer Reviews
Decision·ICLR 2024 poster
The paper is well-written and easy to follow. It provides a nice alternative to the commonly used binned ECE estimator, which has the potential to be widely used in the model calibration literature. The theoretical property of the proposed estimator is carefully studied and the computational complexity is also addressed.
1. The major contribution of the paper is the new smooth ECE estimator over the commonly used Binned ECE. The paper discussed several disadvantages or flaws of the binned ECE in the introduction, but I think these flaws are not well demonstrated in the experiments, e.g. "changing the predictor by an infinitesimally small amount may change its ECE drastically", "overly sensitive to the choice of bin widths." I think it is beneficial to include some synthetic experiments to demonstrate these probl
**originality** The proposed SmoothECE is novel so do the theoretical results. **quality** The proposed method is sound. It's consistency is proved as a result of the combination of the use of reflected Gaussian kernels and the way to set $\sigma$, which is very neat. **clarity** The paper is well-written and easy to follow. **significance** SmoothECE is a drop-in replacement of BinnedECE and can be potentially widely used by the community. Apart from this, as SmoothECE also alleviates the di
The experiment section is weak. If the proposed method alleviates the discontinuity problem of BinnedECE (to be confirmed in Questions), some experiments showing how it can be beneficial (e.g. optimizing a loss involving the calibration metric) should be included. The code is not provided. Perhaps the author(s) can use https://anonymous.4open.science to share it anonymously.
While a number of miscalibration measures that use kernels have been proposed, they are not "human-interpretable" like an l1-ECE is. In particular, they cannot be plotted on a reliability diagram. This paper thus fills an important gap. The proposed reliability diagram is also principled in that it leads to a consistent calibration measure, in the sense of Błasiok et al. (2023). The paper is easy-to-follow, clearly describes the issue that is being targeted, and the proposed solution is well-c
In my opinion, the quality of the paper can be improved with more detailed research and better presentation. I have a number of questions, and I feel at least some of them should be answered before publication. (Thus I have given a contribution rating of 2 since I believe more evidence needs to be provided to show the proposed method indeed solves the problem satisfactorily.) ## Theory/method questions: - Page 3 bottom (initial proposal for smECE): I feel that the actual smECE defined in eq.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Measurement and Uncertainty Evaluation · Advanced Statistical Process Monitoring · Advanced Statistical Methods and Models
MethodsRadial Basis Function
