Optimized Feature Generation for Tabular Data via LLMs with Decision   Tree Reasoning

Jaehyun Nam; Kyuyoung Kim; Seunghyuk Oh; Jihoon Tack; Jaehyung Kim,; Jinwoo Shin

arXiv:2406.08527·cs.LG·November 19, 2024·3 cites

Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim,, Jinwoo Shin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces OCTree, a novel framework that leverages large language models and decision tree reasoning to improve feature generation in tabular data, surpassing existing automated methods.

Contribution

It presents a new LLM-based feature engineering approach that uses decision trees for iterative feedback, eliminating the need for predefined search spaces and enhancing model performance.

Findings

01

OCTree outperforms existing automated feature engineering methods.

02

The framework improves prediction accuracy across diverse benchmarks.

03

Decision tree reasoning effectively guides feature generation.

Abstract

In tabular prediction tasks, tree-based models combined with automated feature engineering methods often outperform deep learning approaches that rely on learned representations. While these feature engineering techniques are effective, they typically depend on a pre-defined search space and primarily use validation scores for feature selection, thereby missing valuable insights from previous experiments. To address these limitations, we propose a novel tabular learning framework that utilizes large language models (LLMs), termed Optimizing Column feature generator with decision Tree reasoning (OCTree). Our key idea is to leverage the reasoning capabilities of LLMs to identify effective feature generation rules without manually specifying the search space and provide language-based reasoning information highlighting past experiments as feedback for iterative rule improvements. We use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaehyun513/octree
noneOfficial

Videos

Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies