Optimizing Calibration by Gaining Aware of Prediction Correctness
Yuchi Liu, Lei Wang, Yuli Zou, James Zou, Liang Zheng

TL;DR
This paper introduces a novel post-hoc calibration method that improves confidence alignment with prediction correctness by leveraging transformed sample versions, addressing limitations of traditional CE loss-based calibration.
Contribution
It proposes a new calibration objective that better aligns confidence with correctness and utilizes transformed samples during training for enhanced calibration performance.
Findings
Achieves competitive calibration on in-distribution and out-of-distribution data.
Addresses limitations of CE loss in calibration tasks.
Provides analysis distinguishing the new method from traditional objectives.
Abstract
Model calibration aims to align confidence with prediction correctness. The Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class. However, we find the CE loss has intrinsic limitations. For example, for a narrow misclassification (e.g., a test sample is wrongly classified and its softmax score on the ground truth class is 0.4), a calibrator trained by the CE loss often produces high confidence on the wrongly predicted class, which is undesirable. In this paper, we propose a new post-hoc calibration objective derived from the aim of calibration. Intuitively, the proposed objective function asks that the calibrator decrease model confidence on wrongly predicted samples and increase confidence on correctly predicted samples. Because a sample itself has insufficient ability to indicate correctness, we use…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper presents a range of validation scenarios to assess the effectiveness of the proposed framework. In numerous cases, the framework achieves state-of-the-art performance, validating the impact of its two novel schemes. The experimental setup and comparisons are thoughtfully designed, with detailed descriptions that enhance clarity and reproducibility. Mathematical derivations are presented comprehensively, and the overall narrative is organized in a way that makes the framework easy to f
The paper has several strengths, yet I have some specific concerns that warrant attention: 1. Definition of "Narrow Misclassification": The term "narrow misclassification" appears in the abstract, and the correctness-aware (CA) loss is presented as targeting this condition by adjusting predictions across different classes rather than solely reducing confidence in the incorrect class. However, a clear definition of "narrow misclassification" is missing, and it’s challenging to discern how it
- The idea of using test-time augmentation to predict a sample based temperature scaling factor and learning a network for predicting such temperature is novel, as far as I know. - The justification of the loss on a toy example pointing out its behavior on so-called narrowly wrong samples is intuitive. - Rather extensive experiments on several types of image datasets show the benefit of the approach over standard calibration methods and other optimization losses.
- The goal of the formal development (Section 3.2) is not clear: what is it supposed to show? Is it to prove that the empirical criterion (7) is a good proxy for optimizing (3), given that $\hat{c}$ is produced by the calibration pipeline of Figure 2? If so, I am not convinced that the formal developments of Section 3.2 actually prove this. - The writing lacks precision (see my first question, same symbol $E_f^{emp}$ but different concepts for instance). - The data augmentation is justified b
**Novel Calibration Objective:** The paper introduces a new loss function, CA loss, which is a significant contribution to the field of model calibration. This loss function is intuitively designed to align with the goal of calibration, which is to ensure high confidence for correct predictions and low confidence for incorrect ones. **Empirical Evidence:** The authors provide extensive experimental results demonstrating the effectiveness of their proposed method across various datasets, includi
**Dependency on Transformations:** The effectiveness of the CA loss relies on the use of transformed images to infer correctness. If these transformations do not adequately capture the characteristics of correct and incorrect predictions, the calibration might be less effective. **Transfomations lack of theoretics:** While the use of transformations such as rotation, grayscale, color jittering, and others has proven to be effective in practice; however, the choice of transformations and their n
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSparse Evolutionary Training · Softmax · ALIGN
