Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
Noah Hollmann, Samuel M\"uller, Frank Hutter

TL;DR
This paper introduces CAAFE, a novel method leveraging large language models to enhance feature engineering in AutoML by generating semantically meaningful features based on dataset descriptions, improving model performance and interpretability.
Contribution
The paper presents CAAFE, a context-aware feature engineering approach using LLMs that improves AutoML performance and interpretability by generating meaningful features from dataset descriptions.
Findings
CAAFE improves ROC AUC on 11 out of 14 datasets.
It boosts mean ROC AUC from 0.798 to 0.822.
Provides interpretable textual explanations for generated features.
Abstract
As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
