Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap
Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi, Nassim Bouarour, Vinay Kumar Sankarapu

TL;DR
This paper evaluates ensemble strategies for tabular foundation models, revealing a diversity ceiling and a calibration trade-off, with cascade stacking slightly outperforming single models at high computational cost.
Contribution
It benchmarks six ensemble strategies on 153 datasets, identifying the near-redundancy of models and analyzing the effects of stacking and calibration.
Findings
Cascade stacking improves accuracy by 0.18% over best single model.
Ensemble diversity is limited, with models forming a near-redundant pool.
Meta-learner stacking sharpens class boundaries but worsens calibration.
Abstract
Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is , close enough to that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys accuracy over the strongest single TFM at the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
