TL;DR
DiffPrep introduces a differentiable approach to automatically search for optimal data preprocessing pipelines for tabular data, significantly improving model accuracy with reduced computational cost.
Contribution
It formalizes data preprocessing pipeline search as a bi-level optimization problem and proposes a differentiable method that enables efficient gradient-based search.
Findings
Achieved best test accuracy on 15 out of 18 datasets
Improved test accuracy by up to 6.6 percentage points
Reduces the need for multiple model trainings during pipeline search
Abstract
Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
