Human-LLM Collaborative Feature Engineering for Tabular Data
Zhuoyan Li, Aditya Bansal, Jinzhao Li, Shishuang He, Zhuoran Lu, Mutian Zhang, Qin Liu, Yiwei Yang, Swati Jain, Ming Yin, Yunyao Li

TL;DR
This paper introduces a collaborative framework combining human expertise and large language models for more effective feature engineering in tabular data, improving performance and reducing cognitive load.
Contribution
It decouples operation proposal and selection, incorporating human feedback to better estimate utility and prioritize promising feature transformations.
Findings
Improved feature engineering performance on various datasets.
Reduced cognitive load for users during feature engineering.
Effective integration of human feedback enhances operation selection.
Abstract
Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human-LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by…
Peer Reviews
Decision·ICLR 2026 Poster
I think the paper suggests a valid approach to LLM-based feature engineering. The use of utility models helps selectively evaluate promising features proposed by the LLM and thus can reduce the computation cost of feature evaluations. The paper also presents an approach to incorporating human preference feedback in utility modeling. The mathematical rationale of the framework is well explained.
It would be appreciated if the authors could provide further details on the setting of BNNs and the optimization algorithms for solving Equations (5) and (17). Some experimental details are missing. The standard errors across repeated runs are not provided. It is not stated how the parameters of downstream models have been selected, which could have an impact on the performance. The experiments could evaluate other LLM backbones in addition to GPT-4o. It would be great to also include a study
(1)Innovative Method Design with Clear Targets:The core idea of decoupling LLM’s "operation proposal" and "operation selection" effectively addresses the key limitation of existing LLM-powered methods (i.e., LLMs acting as black-box optimizers). By introducing explicit utility and uncertainty modeling, the framework avoids blind exploration of low-yield operations, and the selective human feedback mechanism balances the value of human expertise and cognitive cost. (2)Comprehensive Experimental V
(1)Limitation in Human Feedback Simulation: The "w/ Human" setting in the experiment uses GPT-4o to simulate human experts, rather than recruiting real domain experts for feedback. This may deviate from the actual scenario where human experts rely on domain experience to make judgments, and the authenticity of feedback needs to be further verified. (2)Insufficient Discussion on Scalability: The paper does not discuss the framework’s performance in ultra-large-scale tabular data scenarios. The BN
1. Clearly motivated framework. The authors rightly point out the conceptual limitation of existing LLM-based feature engineering approaches, which is that an LLM is used as a black-box optimizer for both proposing and selecting new features. Instead, the paper proposes using LLMs solely for generation with the selection guided by a Bayesian neural network as a surrogate model for utility and uncertainty estimation. 2. Principled human-in-the-loop mechanism. For more accurate estimate utility,
1. Missing ablations on the backbone model. The authors primarily evaluate GPT-4o as the backbone model for LLM-based feature engineering methods. It would be useful to analyze how much the joint feature proposal and selection is an issue depending on the backbone model, e.g., comparing open vs. closed models, different model sizes, and reasoning-enabled vs. non-reasoning models. 2. Missing experiments with humans of varying levels of expertise. In the main experiments, GPT-4o is used as a simu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
