Automating biomedical data science through tree-based pipeline optimization
Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A., Lavender, La Creis Kidd, Jason H. Moore

TL;DR
This paper presents TPOT, a tree-based pipeline optimization tool that automates machine learning pipeline design, achieving competitive accuracy and discovering novel operators on genetic data, while addressing overfitting challenges.
Contribution
Introduction of TPOT, a novel automated pipeline optimization method using tree-based algorithms for machine learning tasks.
Findings
TPOT achieves competitive classification accuracy.
TPOT discovers novel pipeline operators like synthetic feature constructors.
Pipeline overfitting remains a challenge to address.
Abstract
Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Machine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms
