TL;DR
The paper introduces the Manokhin Probability Matrix, a diagnostic tool that separates calibration and discrimination in classifier probability assessment, supported by extensive empirical analysis.
Contribution
It proposes a novel 2D diagnostic framework for classifier probability quality and provides empirical and theoretical insights into calibration and discrimination trade-offs.
Findings
Classifiers are categorized into four archetypes based on calibration and discrimination.
Calibration improves log-loss significantly on certain classifiers, but can degrade others.
No order-preserving calibrator can enhance discriminatory power, emphasizing the importance of fixing calibration before discrimination.
Abstract
The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
