Large Language Models Engineer Too Many Simple Features For Tabular Data
Jaris K\"uken, Lennart Purucker, Frank Hutter

TL;DR
This paper investigates biases in large language models used for tabular data feature engineering, revealing a tendency towards simple operators that can hinder performance and calling for bias mitigation strategies.
Contribution
It introduces a method to detect biases in LLM-generated features and evaluates four models across multiple datasets, highlighting the bias towards simple operators and its impact on predictive accuracy.
Findings
LLMs favor simple operators like addition
Bias towards simple features can reduce model performance
Detection method reveals operator usage biases in LLMs
Abstract
Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
It is good to see more examples of evaluation of downstream LLM tasks "in the wild". I appreciate that the authors were rigorous in removing datasets that were thought to be memorized or in the training data of the LLM, even though they did not have access to the training data itself.
To me, this doesn’t seem like an influential enough contribution. Not only is it tackling a very narrow problem, but it is also only evaluating a specific method for addressing that problem. While there is some prior work around using LLMs for feature engineering, I’m not convinced that this work’s method for feature engineering is necessarily representative of all the methods for using LLMs for this task. Specifically, the authors only use one prompting strategy, on a snapshot of models at th
The authors experimentally show the limitations of LLMs for feature engineering. The experimental setting is convincing.
1. The conclusions of the paper are along expected lines and are not surprising. A more notable contribution would be to address the limitations. 2. The statistical significance of the results is not provided. 3. The term "bias" is too strong for the problem explored. The authors can use the word "limitation".
The paper is based on solid experimental work, testing using several LLMs and across many datasets, testing for memorization issues separately to check for bias explicitly. The paper is an interesting and easy to follow read. Problematic properties of LLM solution paths for different problems are always appreciated, as we develop more and more systems that significantly rely on this tool, we must strive to understand the biases this seemingly easy fix-all solution of asking an LLM brings into o
The main issue with this paper is that it is rather unclear why the usage of LLMs for this task was explored at all. It seems that when feature engineering is done by an LLM, the downstream system's performance is worse than existing SOTA systems - and sometimes even worse than doing any feature engineering at all. Frankly, it's also not a task that I would intuitively expect LLMs to be good at, as general knowledge, common sense and language knowledge is probably not what humans would use for
The paper presents a novel investigation into LLMs' feature engineering capabilities. The authors introduce an innovative evaluation metric—operator frequency distribution—which effectively quantifies the patterns in operator selection during feature construction. This metric provides valuable insights into how feature engineering tools, particularly LLMs, exhibit preferences for certain operators under different task contexts and prompt conditions. Furthermore, the study's comprehensive evaluat
The paper's analysis lacks sufficient depth in several crucial areas. While the proposed operator frequency metric is interesting, it requires further validation in terms of: Effectiveness: There is no analysis comparing the variability and information content of features generated by simple versus complex operators. Fairness: The operator-level analysis overlooks that identical operators applied to different features can yield vastly different outcomes, making tool comparisons based solely on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
