Large Language Models for Automated Data Science: Introducing CAAFE for   Context-Aware Automated Feature Engineering

Noah Hollmann; Samuel M\"uller; Frank Hutter

arXiv:2305.03403·cs.AI·October 2, 2023·6 cites

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering

Noah Hollmann, Samuel M\"uller, Frank Hutter

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces CAAFE, a novel method leveraging large language models to enhance feature engineering in AutoML by generating semantically meaningful features based on dataset descriptions, improving model performance and interpretability.

Contribution

The paper presents CAAFE, a context-aware feature engineering approach using LLMs that improves AutoML performance and interpretability by generating meaningful features from dataset descriptions.

Findings

01

CAAFE improves ROC AUC on 11 out of 14 datasets.

02

It boosts mean ROC AUC from 0.798 to 0.822.

03

Provides interpretable textual explanations for generated features.

Abstract

As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification