Kitana: Efficient Data Augmentation Search for AutoML

Zezhou Huang; Pranav Subramaniam; Raul Castro Fernandez; Eugene Wu

arXiv:2305.10419·cs.DB·May 18, 2023·1 cites

Kitana: Efficient Data Augmentation Search for AutoML

Zezhou Huang, Pranav Subramaniam, Raul Castro Fernandez, Eugene Wu

PDF

Open Access

TL;DR

Kitana is a data-centric AutoML system that efficiently searches for data augmentation opportunities from a large dataset corpus, significantly improving model quality and reducing search time and costs.

Contribution

It introduces a novel approach to data augmentation search in AutoML by leveraging a dataset corpus and a fast proxy model, outperforming existing systems in quality and efficiency.

Findings

01

Higher model R2 scores (up to 0.66) achieved with augmentation.

02

Reduces AutoML search time by over 100 times.

03

Produces better models at a fraction of the cost.

Abstract

AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial training data quality. With the increasing volume of tabular data available, there is a huge opportunity for data augmentation. For instance, vertical augmentation adds predictive features, while horizontal augmentation adds examples. This augmented training data yields potentially much better AutoML models at a lower cost. However, existing systems either forgo the augmentation opportunities that provide poor models, or apply expensive augmentation searching techniques that drain users' budgets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Data Quality and Management