Sample Margin-Aware Recalibration of Temperature Scaling

Haolan Guo; Linwei Tao; Haoyang Luo; Minjing Dong; Chang Xu

arXiv:2506.23492·cs.LG·July 1, 2025

Sample Margin-Aware Recalibration of Temperature Scaling

Haolan Guo, Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces SMART, a lightweight, data-efficient calibration method that uses the logit gap for robust, sample-aware temperature scaling, significantly improving neural network calibration especially with limited data.

Contribution

SMART is a novel, sample margin-aware recalibration technique that leverages the logit gap and a soft-binned SoftECE objective for robust, efficient calibration.

Findings

01

Achieves state-of-the-art calibration with fewer parameters.

02

Performs well across diverse datasets and architectures.

03

Effective even with limited calibration data.

Abstract

Recent advances in deep learning have significantly improved predictive accuracy. However, modern neural networks remain systematically overconfident, posing risks for deployment in safety-critical scenarios. Current post-hoc calibration methods face a fundamental dilemma: global approaches like Temperature Scaling apply uniform adjustments across all samples, introducing high bias despite computational efficiency, while more expressive methods that operate on full logit distributions suffer from high variance due to noisy high-dimensional inputs and insufficient validation data. To address these challenges, we propose Sample Margin-Aware Recalibration of Temperature (SMART), a lightweight, data-efficient recalibration method that precisely scales logits based on the margin between the top two logits -- termed the logit gap. Specifically, the logit gap serves as a denoised, scalar…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper provides a rigorous and original theoretical explanation for why NLL optimization fails to align with calibration objectives (Proposition 3.1, Lemma 2). 2. It (derivation in Appendix A.1) nicely shows that the feasible temperature range is linearly correlated with the logit margin, partially justifying using the margin as a key indicator of the optimal temperature. 3. The proposed Huber–SoftECE objective inherits the differentiability of SoftECE while offering greater training stab

Weaknesses

1. The paper combines two somewhat known ideas—logit margin as a hardness signal and soft-binned calibration loss. The theoretical insight about NLL–ECE misalignment is new, but intution is widely known in this community. The the methodological innovation may be perceived as an incremental refinement rather than a paradigm shift. 2. Limited intuition for Huber–SoftECE behavior. While the theorem provides an upper-bound guarantee, the paper does not include ablation or visualization showing how

Reviewer 02Rating 4Confidence 5

Strengths

- The paper's motivation is clear and addresses important limitations of existing methods. The identification of the logit margin as a lightweight input for sample-wise calibration is interesting and supported by both theoretical arguments (Prop. 3.4) and empirical analysis. - The proposed method is lightweight and demonstrates good empirical performance on the reported datasets (CIFAR-10/100, ImageNet) and their variants, consistently outperforming a wide range of baselines.

Weaknesses

My main concerns are regarding missing baselines, the theoretical justification for the proposed objective, and the limited scope of the experimental evaluation. - Missing Baselines and modern models: A highly relevant baseline is missing from the evaluation: Density Aware Calibration (Tomani et al., ICML 2023) is a recent, sample-adaptive method that also aims to provide robust calibration, particularly under distribution shift. Given that the authors make strong claims on robustess to such sh

Reviewer 03Rating 8Confidence 4

Strengths

1. Theoretically Grounded Objective Function: Huber–SoftECE addresses a longstanding flaw in NLL-based calibration by directly targeting calibration error. The theoretical guarantee that it upper-bounds smCE ensures alignment between optimization and calibration goals, a rare strength in post-hoc methods that often lack such rigor. 2. Strong Practical Utility: SMART balances performance and efficiency: its lightweight MLP design avoids the computational burden of exisiting methods, making it fe

Weaknesses

The paper acknowledges that SMART may degrade in zero-shot scenarios but provides no further details—e.g., whether it can leverage cross-domain margin signals, or if pre-trained margin-temperature mappings transfer to new domains. This is a critical gap for safety-critical applications where validation data may be scarce.

Reviewer 04Rating 4Confidence 4

Strengths

The empirical results show that the proposed approach outperforms alternative methods in various settings.

Weaknesses

* The novelty should be stated more clearly. There exist works, such as [Wei et al. 2022], that identified relation between dominant logits values and model calibration; there exist works, such as [Karandikar et al. 2021], that proposed and studied soft and differentiable versions of ECE; and there exist works, such as [Tomani et al., 2022], where the temperature is parameterized and gets as input the sorted logic vector (which includes the margin between the first and second largest entries).

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)