DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning   over Tabular Data

Peng Li; Zhiyi Chen; Xu Chu; Kexin Rong

arXiv:2308.10915·cs.DB·August 23, 2023

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Peng Li, Zhiyi Chen, Xu Chu, Kexin Rong

PDF

1 Repo

TL;DR

DiffPrep introduces a differentiable approach to automatically search for optimal data preprocessing pipelines for tabular data, significantly improving model accuracy with reduced computational cost.

Contribution

It formalizes data preprocessing pipeline search as a bi-level optimization problem and proposes a differentiable method that enables efficient gradient-based search.

Findings

01

Achieved best test accuracy on 15 out of 18 datasets

02

Improved test accuracy by up to 6.6 percentage points

03

Reduces the need for multiple model trainings during pipeline search

Abstract

Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chu-data-lab/diffprep
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.