Robustness of AutoML on Dirty Categorical Data
Marcos L. P. Bueno, Joaquin Vanschoren

TL;DR
This paper evaluates how AutoML methods perform on dirty categorical data and proposes a pipeline with advanced encoding schemes to improve robustness and predictive accuracy.
Contribution
It introduces a pipeline transforming categorical data for AutoML and benchmarks its robustness on dirty datasets, providing insights into model performance and pipeline structures.
Findings
AutoML performance improves with advanced encoding schemes on dirty data
The proposed pipeline yields better predictive accuracy than standard AutoML approaches
Insights into AutoML pipeline structures on challenging datasets
Abstract
The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML). Although AutoML methods for classification are able to deal with data imperfections, such as outliers, multiple scales and missing data, their behavior is less known on dirty categorical datasets. These datasets often have several categorical features with high cardinality arising from issues such as lack of curation and automated collection. Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to significantly superior predictive performance. However the effects of using such encoders in AutoML methods are not known at the moment. In this paper, we propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Text and Document Classification Technologies · Imbalanced Data Classification Techniques
