SapientML: Synthesizing Machine Learning Pipelines by Learning from   Human-Written Solutions

Ripon K. Saha; Akira Ura; Sonal Mahajan; Chenguang Zhu; Linyi Li; Yang; Hu; Hiroaki Yoshida; Sarfraz Khurshid; Mukul R. Prasad

arXiv:2202.10451·cs.LG·April 21, 2022

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang, Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad

PDF

TL;DR

SapientML is an AutoML system that learns from human-written pipelines to efficiently generate high-quality models for new datasets, outperforming existing tools on many benchmarks.

Contribution

It introduces a divide-and-conquer, three-stage program synthesis approach that leverages human-written solutions to improve AutoML pipeline generation.

Findings

01

Outperforms state-of-the-art AutoML tools on 27 out of 41 benchmarks.

02

Successfully synthesizes pipelines for large, real-world datasets.

03

Demonstrates the effectiveness of learning from human solutions in AutoML.

Abstract

Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.