Tabular Feature Discovery With Reasoning Type Exploration
Sungwon Han, Sungkyu Park, Seungeon Lee

TL;DR
This paper introduces REFeat, a novel method that guides large language models to generate diverse, meaningful features for tabular data by leveraging multiple reasoning types, improving predictive accuracy and feature diversity.
Contribution
REFeat is the first approach to incorporate multiple reasoning paradigms to steer LLM-based feature discovery for tabular data.
Findings
Achieves higher average predictive accuracy on 59 datasets
Discovers more diverse and meaningful features
Demonstrates the effectiveness of reasoning-guided feature generation
Abstract
Feature engineering for tabular data remains a critical yet challenging step in machine learning. Recently, large language models (LLMs) have been used to automatically generate new features by leveraging their vast knowledge. However, existing LLM-based approaches often produce overly simple or repetitive features, partly due to inherent biases in the transformations the LLM chooses and the lack of structured reasoning guidance during generation. In this paper, we propose a novel method REFeat, which guides an LLM to discover diverse and informative features by leveraging multiple types of reasoning to steer the feature generation process. Experiments on 59 benchmark datasets demonstrate that our approach not only achieves higher predictive accuracy on average, but also discovers more diverse and meaningful features. These results highlight the promise of incorporating rich reasoning…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper is written with running examples. The core technical part is clear and easy to follow. The problem of automated feature engineering is an important problem in data science. The idea of leveraging LLM to generate features based on dataset information is reasonable, and performing the appropriate data engineering operations can speed up the cycle.
For method design, the novelty is limited. Using reasoning (especially with prompting only) for feature generation is not a new topic. This work seems to be more of a combination of existing reasoning prompting paradigm with a controlling mechanism, not a fundamentally new mechanism for reasoning itself. The generation process, which heavily depends on the LLM's internal generation and reasoning variability, would be hard to reproduce and explain. Further, the method lacks any optimization or se
(1) Innovative and Theoretically Grounded Design:The paper is the first to systematically incorporate six classical cognitive reasoning types into LLM-driven tabular feature engineering. This breaks the limitation of "generic prompting" in existing methods, providing structured logical guidance for feature generation and enriching the theoretical application of LLMs in structured data tasks.By framing reasoning type selection as a multi-armed bandit problem, the framework dynamically balances ex
(1) Computational Overhead and Efficiency Trade-Offs:REFEAT requires 20 iterations of LLM calls and model evaluations to generate features, resulting in higher computational costs compared to one-shot baseline methods (e.g., CAAFE). The paper does not discuss how to balance efficiency and performance, limiting its applicability to small-sample or real-time scenarios.Conduct a performance-efficiency curve analysis: Test iterations of 5, 10, 15, and 20 to identify the optimal iteration threshold a
It is a meaningful task to study the impact of prompt design on the performance of LLM-based AutoFE, and agentic systems in general, which may offer insights on the best practice when designing such systems. The methodology part of the paper is presented clearly. Adaptively selecting reasoning strategies benefits the generation of diverse features that may suit different datasets. I appreciate the efforts the authors have made on conducting experiments across 59 OpenML datasets and evaluating di
While the motivation and methodology are clear, the technical novelty of this paper is somewhat limited. Automated prompt engineering approaches such as [1] may give more flexibility and even better performance. As a core of the work, the bandit design needs further exploration. Will it benefit from a different decay schedule of the exploration probability or a different algorithm like UCB [2]? It would be interesting to see further studies on this. For the experiments, the number of augmented
The idea proposed in the work seems very interesting, and combining prompt types and prompt selection introduces a novel optimization concept into LLM-based automated feature engineering. In general, it appears to be a significant methodological change compared to prior work. The motivation for the idea is valid, and bandit-based approaches can be promising. The paper is well-structured and easy to understand, thanks to its clear writing. The related work covers many of the relevant papers.
The core weakness of the paper is its experimental design, which I highly doubt will generalize to real-world applications of the compared methods. There are several reasons (detailed below) for this, most of which are known in the general literature on tabular data. Sadly, the literature on automated feature engineering often ignores this. As a result, **I want to highlight that the quality of the experimental design is in line with prior work on automated feature engineering**, which other rev
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Semantic Web and Ontologies · Web Data Mining and Analysis
