A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data
Andrej Tschalzev, Sascha Marton, Stefan L\"udtke, Christian Bartelt,, Heiner Stuckenschmidt

TL;DR
This paper advocates for a data-centric evaluation approach in machine learning for tabular data, emphasizing dataset-specific preprocessing, feature engineering, and temporal adaptation, which significantly influence model performance and rankings.
Contribution
It introduces a data-centric evaluation framework with expert-level preprocessing pipelines, highlighting the impact of data handling on model assessment and ranking stability.
Findings
Feature engineering alters model rankings and reduces performance gaps.
Manual feature engineering benefits both tree-based models and neural networks.
Adapting to distribution shifts is important even in static tabular data.
Abstract
Tabular data is prevalent in real-world machine learning applications, and new models for supervised learning of tabular data are frequently proposed. Comparative studies assessing the performance of models typically consist of model-centric evaluation setups with overly standardized data preprocessing. This paper demonstrates that such model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering. Therefore, we propose a data-centric evaluation framework. We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset. We conduct experiments with different preprocessing pipelines and hyperparameter optimization (HPO) regimes to quantify the impact of model selection, HPO, feature engineering, and test-time adaptation. Our main findings are: 1. After…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Machine Learning and Data Classification · Computational Physics and Python Applications
MethodsHyper-parameter optimization
