Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications
Christina Sauer, Anne-Laure Boulesteix, Luzia Han{\ss}um, Farina Hodiamont, Claudia Bausewein, and Theresa Ullmann

TL;DR
This paper highlights the often-overlooked impact of preprocessing hyperparameters in machine learning, demonstrating how their improper tuning can lead to overestimated model performance and proposing better practices for model evaluation.
Contribution
It provides a comprehensive review and empirical analysis of preprocessing hyperparameters, emphasizing their role and potential pitfalls in model tuning and evaluation.
Findings
Preprocessing hyperparameters significantly affect model performance.
Unsystematic tuning of preprocessing steps can lead to exaggerated performance claims.
Awareness and proper reporting of preprocessing hyperparameters improve model reliability.
Abstract
Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsSoftmax · Attention Is All You Need · Focus
