A Benchmark Study on Calibration
Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu

TL;DR
This study conducts a comprehensive large-scale analysis of neural network calibration using an extensive dataset of over 117,000 models within the NAS framework, addressing key questions about calibration generalization, metrics, and architectural influences.
Contribution
It introduces a novel calibration dataset for NAS models and provides the first large-scale investigation into calibration properties and their interaction with architecture design.
Findings
Calibration can be generalized across datasets.
Robustness can serve as a calibration measurement.
Calibration metrics vary in reliability.
Abstract
Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper raises 7 initial questions for exploring the calibration properties in deep neural networks. 2. This paper builds a benchmark based on models generated by NAS for evaluating the calibration metrics. 3. This paper conducts extensive experiments to analyze and answer the questions.
1. The proposed calibration benchmark might be limited since it contains only convolution neural networks for image classification. I’m concerned about how about the calibration properties for other tasks, e.g., object detection or NLP tasks. It’s more convincing when extending the benchmark for more architectures and more tasks. 2. Most architectures and networks are generated from the similar search space, which might have similar effects and are limited for the conclusions. Varying the search
S1. I really value the topic of calibration as I believe it is a good feature to have in many classification tasks and systems. I think the paper tackles an important problem. S2. The clarity of the paper is good, the narrative flows well, and is easy to understand and follow.
W1. The motivation about why NAS + Calibration is important is missing in the paper. Unfortunately, the paper lacks a clear justification for studying NAS + Calibration. It is not clear intuitively why this is a good direction to explore. It is not clear why a wholistic approach is not worth exploring over NAS + Calibration. Unfortunately, the paper makes the reader believe that the analysis was done just because it has not done before. I think the paper really needs to justify why NAS + Calibra
- The study of calibration properties of deep neural networks is an important research direction as it could allow developing well-calibrated architectures. - The paper develops a comprehensive benchmark of neural network architectures that are then evaluated on different datasets to answer various questions. Further, recent vision transformer architectures have also been included as part of evaluation. - Some questions included in the study are interesting and important: such as the Impact of
- Overall, the new questions posed and studied by the paper boils down to 1), 3) and 6) which are: -- Model Calibration across different datasets -- Reliability of calibration metrics -- Impact of bin size on calibration metrics - Other questions are mostly expansion of existing studies. This seems to undermine the overall contributions of the paper to some extent. -Only post-hoc temperature scaling is used as a calibration technique to evaluate pre- and post calibration performance of a la
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Machine Learning and Data Classification
