Regularized target encoding outperforms traditional methods in   supervised machine learning with high cardinality features

Florian Pargent; Florian Pfisterer; Janek Thomas; Bernd Bischl

arXiv:2104.00629·stat.ML·March 7, 2022

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Florian Pargent, Florian Pfisterer, Janek Thomas, Bernd Bischl

PDF

2 Repos

TL;DR

This study demonstrates that regularized target encoding significantly improves predictive performance in machine learning tasks involving high-cardinality categorical features, outperforming traditional encoding methods across various algorithms and datasets.

Contribution

The paper provides a comprehensive benchmark showing that regularized target encoding outperforms traditional encoding techniques in high-cardinality scenarios, offering practical guidelines for encoding choices.

Findings

01

Regularized target encoding yields the best predictive performance.

02

Traditional encodings like integer or leaf encoding are less effective.

03

The results are consistent across multiple algorithms and dataset types.

Abstract

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass- classification settings. In our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.