Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
Micha{\l} Pietruszka, {\L}ukasz Borchmann, Aleksander J\k{e}drosz,, Pawe{\l} Morawiecki

TL;DR
This paper introduces a benchmark to evaluate large language models' ability to generate feature engineering code in data science, assessing their practical usefulness through improvements in model performance.
Contribution
It presents a novel benchmark for LLMs to perform feature engineering tasks, enabling efficient evaluation of their data science capabilities.
Findings
LLMs can generate feature engineering code that improves model performance.
The proposed benchmark effectively assesses LLMs' data science skills.
State-of-the-art models show varying levels of success on the task.
Abstract
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsResearch Data Management Practices · Scientific Computing and Data Management · Semantic Web and Ontologies
