Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
Jonas Landsgesell, Pascal Knoll

TL;DR
This paper advocates for evaluating tabular foundation models using proper scoring rules to better assess the quality of their probabilistic predictions, and demonstrates how training with these rules influences model performance.
Contribution
It introduces the use of proper scoring rules for evaluation and shows how fine-tuning with these rules improves model calibration and ranking.
Findings
Proper scoring rules lead to different model rankings and biases.
Fine-tuning with scoring rules improves predictive distribution quality.
Benchmarking should include distributional metrics for comprehensive evaluation.
Abstract
Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, ). This mismatch implicitly rewards models that elicit a good conditional mean while ignoring the quality of the predicted distribution. We make two contributions. First, we propose supplementing standard point metrics with proper scoring rules (CRPS, CRLS, and the Interval Score) and provide a head-to-head comparison of realTabPFNv2.5 and TabICLv2 with regards to some proper scoring rules across 20 OpenML regression datasets. Second, we show analytically and empirically that different proper scoring rules induce different model rankings and different inductive biases during training, even though each rule is individually minimized by the true…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
