TL;DR
This study systematically compares five data scaling techniques across twenty classification algorithms and eighty-two datasets, revealing that the choice of scaling significantly impacts model performance and can sometimes be more critical than not scaling at all.
Contribution
It provides a comprehensive experimental analysis of how different scaling methods affect classification performance, highlighting the importance of selecting appropriate scaling techniques.
Findings
The choice of scaling technique significantly affects classification accuracy.
Inadequate scaling can harm performance more than no scaling.
Ensemble models' sensitivity to scaling mirrors that of their base models.
Abstract
Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
