Large Language Models Engineer Too Many Simple Features For Tabular Data

Jaris K\"uken; Lennart Purucker; Frank Hutter

arXiv:2410.17787·cs.LG·July 16, 2025

Large Language Models Engineer Too Many Simple Features For Tabular Data

Jaris K\"uken, Lennart Purucker, Frank Hutter

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper investigates biases in large language models used for tabular data feature engineering, revealing a tendency towards simple operators that can hinder performance and calling for bias mitigation strategies.

Contribution

It introduces a method to detect biases in LLM-generated features and evaluates four models across multiple datasets, highlighting the bias towards simple operators and its impact on predictive accuracy.

Findings

01

LLMs favor simple operators like addition

02

Bias towards simple features can reduce model performance

03

Detection method reveals operator usage biases in LLMs

Abstract

Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

It is good to see more examples of evaluation of downstream LLM tasks "in the wild". I appreciate that the authors were rigorous in removing datasets that were thought to be memorized or in the training data of the LLM, even though they did not have access to the training data itself.

Weaknesses

To me, this doesn’t seem like an influential enough contribution. Not only is it tackling a very narrow problem, but it is also only evaluating a specific method for addressing that problem. While there is some prior work around using LLMs for feature engineering, I’m not convinced that this work’s method for feature engineering is necessarily representative of all the methods for using LLMs for this task. Specifically, the authors only use one prompting strategy, on a snapshot of models at th

Reviewer 02Rating 3Confidence 5

Strengths

The authors experimentally show the limitations of LLMs for feature engineering. The experimental setting is convincing.

Weaknesses

1. The conclusions of the paper are along expected lines and are not surprising. A more notable contribution would be to address the limitations. 2. The statistical significance of the results is not provided. 3. The term "bias" is too strong for the problem explored. The authors can use the word "limitation".

Reviewer 03Rating 5Confidence 4

Strengths

The paper is based on solid experimental work, testing using several LLMs and across many datasets, testing for memorization issues separately to check for bias explicitly. The paper is an interesting and easy to follow read. Problematic properties of LLM solution paths for different problems are always appreciated, as we develop more and more systems that significantly rely on this tool, we must strive to understand the biases this seemingly easy fix-all solution of asking an LLM brings into o

Weaknesses

The main issue with this paper is that it is rather unclear why the usage of LLMs for this task was explored at all. It seems that when feature engineering is done by an LLM, the downstream system's performance is worse than existing SOTA systems - and sometimes even worse than doing any feature engineering at all. Frankly, it's also not a task that I would intuitively expect LLMs to be good at, as general knowledge, common sense and language knowledge is probably not what humans would use for

Reviewer 04Rating 3Confidence 4

Strengths

The paper presents a novel investigation into LLMs' feature engineering capabilities. The authors introduce an innovative evaluation metric—operator frequency distribution—which effectively quantifies the patterns in operator selection during feature construction. This metric provides valuable insights into how feature engineering tools, particularly LLMs, exhibit preferences for certain operators under different task contexts and prompt conditions. Furthermore, the study's comprehensive evaluat

Weaknesses

The paper's analysis lacks sufficient depth in several crucial areas. While the proposed operator frequency metric is interesting, it requires further validation in terms of: Effectiveness: There is no analysis comparing the variability and information content of features generated by simple versus complex operators. Fairness: The operator-level analysis overlooks that identical operators applied to different features can yield vastly different outcomes, making tool comparisons based solely on

Code & Models

Repositories

automl/llms_feature_engineering_bias
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling