Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods
Valery Manokhin, Daniel Gr{\o}nhaug

TL;DR
This study empirically evaluates various post-hoc calibration methods for binary classifiers on tabular data, highlighting the strengths of Venn-Abers and Beta calibration in improving probabilistic predictions while noting the limitations of traditional methods like Platt scaling.
Contribution
It provides a comprehensive benchmark of 21 classifiers and 5 calibration methods, revealing the relative effectiveness and limitations of each in real-world tabular classification tasks.
Findings
Venn-Abers predictors achieve the largest average log-loss reduction.
Beta calibration most frequently improves log-loss across tasks.
Platt scaling and isotonic regression can degrade calibration performance.
Abstract
We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
