Can Models Help Us Create Better Models? Evaluating LLMs as Data   Scientists

Micha{\l} Pietruszka; {\L}ukasz Borchmann; Aleksander J\k{e}drosz,; Pawe{\l} Morawiecki

arXiv:2410.23331·cs.CL·November 1, 2024

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Micha{\l} Pietruszka, {\L}ukasz Borchmann, Aleksander J\k{e}drosz,, Pawe{\l} Morawiecki

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a benchmark to evaluate large language models' ability to generate feature engineering code in data science, assessing their practical usefulness through improvements in model performance.

Contribution

It presents a novel benchmark for LLMs to perform feature engineering tasks, enabling efficient evaluation of their data science capabilities.

Findings

01

LLMs can generate feature engineering code that improves model performance.

02

The proposed benchmark effectively assesses LLMs' data science skills.

03

State-of-the-art models show varying levels of success on the task.

Abstract

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FeatEng/FeatEng
noneOfficial

Datasets

FeatEng/Benchmark
dataset· 20 dl
20 dl

Videos

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists· underline

Taxonomy

TopicsResearch Data Management Practices · Scientific Computing and Data Management · Semantic Web and Ontologies