TL;DR
This study demonstrates that regularized target encoding significantly improves predictive performance in machine learning tasks involving high-cardinality categorical features, outperforming traditional encoding methods across various algorithms and datasets.
Contribution
The paper provides a comprehensive benchmark showing that regularized target encoding outperforms traditional encoding techniques in high-cardinality scenarios, offering practical guidelines for encoding choices.
Findings
Regularized target encoding yields the best predictive performance.
Traditional encodings like integer or leaf encoding are less effective.
The results are consistent across multiple algorithms and dataset types.
Abstract
Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass- classification settings. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
