Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

Valery Manokhin; Daniel Gr{\o}nhaug

arXiv:2601.19944·cs.LG·January 29, 2026

Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

Valery Manokhin, Daniel Gr{\o}nhaug

PDF

Open Access

TL;DR

This study empirically evaluates various post-hoc calibration methods for binary classifiers on tabular data, highlighting the strengths of Venn-Abers and Beta calibration in improving probabilistic predictions while noting the limitations of traditional methods like Platt scaling.

Contribution

It provides a comprehensive benchmark of 21 classifiers and 5 calibration methods, revealing the relative effectiveness and limitations of each in real-world tabular classification tasks.

Findings

01

Venn-Abers predictors achieve the largest average log-loss reduction.

02

Beta calibration most frequently improves log-loss across tasks.

03

Platt scaling and isotonic regression can degrade calibration performance.

Abstract

We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning