# Towards Personalized Preprocessing Pipeline Search

**Authors:** Diego Martinez, Daochen Zha, Qiaoyu Tan, Xia Hu

arXiv: 2302.14329 · 2023-03-01

## TL;DR

This paper introduces ClusterP3S, a framework that personalizes feature preprocessing in AutoML by clustering features to reduce search space and optimize pipelines, leading to improved classification performance.

## Contribution

It proposes a hierarchical search strategy combining deep clustering and reinforcement learning to enable feature-wise preprocessing pipeline search in AutoML.

## Key findings

- Effective feature-wise preprocessing pipeline search demonstrated on benchmark datasets.
- Hierarchical clustering and search improve AutoML preprocessing performance.
- Reduces exponential search space by clustering features for tailored preprocessing.

## Abstract

Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since different datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a different preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be significantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a specific cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper level, and random search at the lower level. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14329/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14329/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/2302.14329/full.md

---
Source: https://tomesphere.com/paper/2302.14329