Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications

Christina Sauer; Anne-Laure Boulesteix; Luzia Han{\ss}um; Farina Hodiamont; Claudia Bausewein; and Theresa Ullmann

arXiv:2412.03491·stat.ML·August 18, 2025·3 cites

Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications

Christina Sauer, Anne-Laure Boulesteix, Luzia Han{\ss}um, Farina Hodiamont, Claudia Bausewein, and Theresa Ullmann

PDF

Open Access 1 Repo

TL;DR

This paper highlights the often-overlooked impact of preprocessing hyperparameters in machine learning, demonstrating how their improper tuning can lead to overestimated model performance and proposing better practices for model evaluation.

Contribution

It provides a comprehensive review and empirical analysis of preprocessing hyperparameters, emphasizing their role and potential pitfalls in model tuning and evaluation.

Findings

01

Preprocessing hyperparameters significantly affect model performance.

02

Unsystematic tuning of preprocessing steps can lead to exaggerated performance claims.

03

Awareness and proper reporting of preprocessing hyperparameters improve model reliability.

Abstract

Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NiesslC/overoptimistic_trees
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

MethodsSoftmax · Attention Is All You Need · Focus