Kitana: Efficient Data Augmentation Search for AutoML
Zezhou Huang, Pranav Subramaniam, Raul Castro Fernandez, Eugene Wu

TL;DR
Kitana is a data-centric AutoML system that efficiently searches for data augmentation opportunities from a large dataset corpus, significantly improving model quality and reducing search time and costs.
Contribution
It introduces a novel approach to data augmentation search in AutoML by leveraging a dataset corpus and a fast proxy model, outperforming existing systems in quality and efficiency.
Findings
Higher model R2 scores (up to 0.66) achieved with augmentation.
Reduces AutoML search time by over 100 times.
Produces better models at a fraction of the cost.
Abstract
AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial training data quality. With the increasing volume of tabular data available, there is a huge opportunity for data augmentation. For instance, vertical augmentation adds predictive features, while horizontal augmentation adds examples. This augmented training data yields potentially much better AutoML models at a lower cost. However, existing systems either forgo the augmentation opportunities that provide poor models, or apply expensive augmentation searching techniques that drain users' budgets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Data Quality and Management
