ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules
Jonas Landsgesell, Pascal Knoll, Tizian Wenzel

TL;DR
ScoringBench is an open benchmark that evaluates tabular regression models using proper scoring rules to better assess their probabilistic predictions across diverse datasets.
Contribution
It introduces a comprehensive, community-driven benchmark with multiple scoring protocols for evaluating probabilistic tabular models, highlighting the importance of metric choice.
Findings
Model rankings vary significantly depending on the scoring rule used.
Models excelling in point-estimate metrics may perform poorly on probabilistic scores.
Evaluation metrics influence model selection and deployment decisions.
Abstract
Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet prevailing regression benchmarks evaluate them almost exclusively via point-estimate metrics (RMSE, ). This discards precisely the distributional information these models are designed to provide - a critical gap for high-stakes domains where not all kinds of errors are equally costly. We introduce ScoringBench, an open and extensible benchmark that evaluates tabular regression models under a comprehensive suite of proper scoring rules - including CRPS, CRLS, interval score, energy score, and weighted CRPS - alongside standard point metrics. ScoringBench covers 97 regression datasets from diverse domains, supports transparent community contributions via a git-based leaderboard, and provides two complementary ranking protocols: an ordinal Demsar/autorank approach and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
