GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement

Yang Gao; Dongjie Wang; Scott Piersall; Ye Zhang; Liqiang Wang

arXiv:2508.20824·cs.LG·August 29, 2025

GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement

Yang Gao, Dongjie Wang, Scott Piersall, Ye Zhang, Liqiang Wang

PDF

Open Access

TL;DR

This paper introduces GPT-FT, a transformer-based framework that automates feature transformation for machine learning, achieving high performance with reduced computational costs through a novel multi-step process.

Contribution

The paper presents a new GPT-based method for automated feature transformation that improves efficiency and scalability over existing encoder-decoder approaches.

Findings

01

Matches or exceeds baseline performance on benchmarks

02

Significantly reduces computational costs

03

Enhances scalability of feature transformation processes

Abstract

Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a) feature transformation sequence…

Tables5

Table 1. Table 1 : Comparison of Overall Performance: Results for binary classification are labeled as "C," while "R" indicates regression tasks. The highest performance values are shown in bold , with the second-highest values underlined. ( Greater values signify superior performance. )

Dataset	Source	C/R	Samples	Features	RDG	ERG	LDA	AFAT	NFS	TTG	GRFG	DIFER	MOAT	GPT-FT
Contraceptive Method Choice [28]	UCIrvine	C	1473	9	0.493	0.505	0.366	0.503	0.52	0.508	0.533	0.538	0.537	0.544
Heart Disease [15]	UCIrvine	C	303	13	0.851	0.850	0.763	0.834	0.834	0.868	0.831	0.841	0.866	0.867
Ozone Level Detection [45]	UCIrvine	C	2536	72	0.959	0.959	0.957	0.961	0.956	0.96	0.958	0.956	0.961	0.962
Seeds [2]	UCIrvine	C	210	7	0.969	0.971	0.736	0.971	0.971	0.965	0.926	0.957	0.971	0.977
Titanic [17]	Kaggle	C	891	11	0.814	0.829	0.736	0.818	0.82	0.814	0.82	0.825	0.831	0.832
Lymphography [49]	UCIrvine	C	148	18	0.108	0.144	0.167	0.15	0.152	0.148	0.182	0.15	0.267	0.352
Amazon Employee [31]	Kaggle	C	32769	9	0.932	0.934	0.916	0.93	0.932	0.933	0.932	0.929	0.936	0.983
Wine Quality Red [5]	UCIrvine	C	999	12	0.466	0.461	0.433	0.48	0.462	0.467	0.47	0.476	0.559	0.622
Wine Quality White [5]	UCIrvine	C	4900	12	0.524	0.510	0.449	0.516	0.525	0.507	0.534	0.507	0.536	0.544
Tecator [37]	OpenML	R	240	125	0.541	0.584	0.418	0.541	0.525	0.527	0.750	0.692	0.545	0.885
Geographical OriginalofMusic [46]	UCIrvine	R	1059	118	0.388	0.395	0.317	0.398	0.283	0.28	0.472	0.632	0.481	0.508
Jasmine [36]	OpenML	R	2984	145	0.402	0.415	0.391	0.411	0.406	0.407	0.326	0.447	0.407	0.477
Libras move [6]	OpenML	R	360	91	0.179	0.286	0.085	0.215	0.156	0.226	0.294	0.172	0.293	0.308
Bodyfat [16]	UCIrvine	R	252	15	0.84	0.84	0.282	0.846	0.848	0.853	0.652	0.737	0.843	0.865
Weather [14]	Kaggle	R	366	12	0.969	0.971	0.838	0.975	0.975	0.973	0.96	0.914	0.976	0.98

Table 2. Table 2 : Robustness check of GPT-FT with distinct ML models on Weather dataset in terms of 1-RAE score.

Weather	RF	XGB	SVM	KNN	Ridge	LASSO	DT
RDG	0.969	0.977	0.609	0.871	0.481	0.163	0.971
ERG	0.971	0.971	0.722	0.862	0.48	0.104	0.973
LDA	0.838	0.915	0.248	0.824	0.016	0.217	0.904
AFAT	0.975	0.971	0.629	0.854	0.474	0.209	0.976
NFS	0.975	0.974	0.614	0.865	0.202	0.132	0.976
TTG	0.975	0.97	0.571	0.873	0.197	0.198	0.978
GRFG	0.96	0.962	0.826	0.928	0.327	0.231	0.962
DIFER	0.914	0.906	0.712	0.9	0.461	0.217	0.905
MOAT	0.976	0.975	0.314	0.976	0.484	0.244	0.976
GPT-FT	0.98	0.98	0.831	0.978	0.493	0.251	0.981

Table 3. Table 3 : Robustness check of GPT-FT with distinct ML models on Wine Quality Red dataset in terms of F1-score.

Wine Quality Red	RF	XGB	SVM	KNN	Ridge	LASSO	DT
RDG	0.466	0.591	0.568	0.530	0.561	0.575	0.522
ERG	0.461	0.574	0.570	0.561	0.557	0.576	0.515
LDA	0.433	0.564	0.537	0.493	0.537	0.537	0.535
AFAT	0.480	0.564	0.356	0.436	0.522	0.509	0.490
NFS	0.462	0.561	0.559	0.530	0.573	0.583	0.468
TTG	0.467	0.585	0.560	0.540	0.560	0.575	0.532
GRFG	0.470	0.581	0.580	0.587	0.570	0.580	0.587
DIFER	0.476	0.576	0.538	0.538	0.587	0.587	0.516
MOAT	0.616	0.595	0.507	0.526	0.591	0.586	0.559
GPT-FT	0.622	0.596	0.599	0.587	0.593	0.598	0.594

Table 4. Table 4 : Comparison of model parameter sizes between MOAT and GPT-FT across various datasets. The unit is Megabyte(MB).

Dataset	Samples	Features	MOAT	GPT-FT
Contraceptive Method Choice	1473	9	0.42	0.21
Heart Disease	303	13	0.20	0.10
Ozone Level Detection	2536	72	0.64	0.31
Seeds	210	7	0.32	0.16
Titanic	891	11	0.31	0.15
Lymphography	148	18	0.17	0.08
Amazon Employee	32769	9	6.46	3.21
Wine Quality Red	999	12	0.54	0.27
Wine Quality White	4900	12	1.27	0.63
Tecator	240	125	4.81	2.39
GeographicalOriginalofMusic	1059	118	14.14	7.08
Jasmine	2984	145	0.98	0.49
Libras move	360	91	0.38	0.19
Bodyfat	252	15	0.23	0.11
Weather	366	12	0.92	0.45

Table 5. Table 5 : Comparison of model inference time between MOAT and GPT-FT across various datasets. The unit is second.

Dataset	Samples	Features	MOAT	GPT-FT
Contraceptive Method Choice	1473	9	23.83	23.30
Heart Disease	303	13	24.58	22.61
Ozone Level Detection	2536	72	34.11	27.93
Seeds	210	7	36.89	29.08
Titanic	891	11	25.59	23.56
Lymphography	148	18	27.21	24.44
Amazon Employee	32769	9	32.19	23.43
Wine Quality Red	999	12	27.35	23.94
Wine Quality White	4900	12	25.22	23.54
Tecator	240	125	67.22	39.42
GeographicalOriginalofMusic	1059	118	71.10	41.37
Jasmine	2984	145	57.57	43.02
Libras move	360	91	31.83	29.14
Bodyfat	252	15	34.67	28.35
Weather	366	12	24.12	22.79

Equations2

Γ^{*} = ψ (E^{*}) = ar g E max P (Q (ψ {ϕ [τ (X)]}), y),

Γ^{*} = ψ (E^{*}) = ar g E max P (Q (ψ {ϕ [τ (X)]}), y),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

Full text

11institutetext: University of Central Florida, Orlando FL 32816, USA

11email: {yang.gao,sc382961,liqiang.wang}@ucf.edu 22institutetext: University of Kansas, Lawrence KS 66045, USA

22email: [email protected]

33institutetext: Northeast Normal University, Changchun Jilin PRC

33email: [email protected]

GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement

Yang Gao 11

Dongjie Wang 22

Scott Piersall 11

Ye Zhang 33

Liqiang Wang 11

Abstract

Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a). feature transformation sequence reconstruction; and (b) model performance estimation and enhancement for downstream tasks by constructing the embedding space. Such a multi-objective optimization framework reduces parameter size and accelerates transformation processes. Experimental results on benchmark datasets show that the proposed framework matches or exceeds baseline performance, with significant gains in computational efficiency. This work highlights the potential of transformer-based architectures for scalable, high-performance automated feature transformation.

Keywords:

Automated Feature Transformation Generative Pre-Trained Transformer Multi-Objective Optimization.

1 Introduction

Feature transformation is a pivotal component in machine learning pipelines, aiming to enhance downstream tasks’ model performance by optimizing data representations. An effective feature transformation can significantly impact the predictive accuracy of models, especially in scenarios involving complex or high-dimensional datasets. Traditional approaches often rely on manual feature engineering, which is time-consuming and requires substantial domain expertise. This has spurred interests and studies in automated feature transformation (AFT) methods that can systematically and efficiently explore feature space.

The current algorithms for AFT can be broadly classified into three distinct categories: (1) Expansion-Reduction Methodologies: These approaches, such as Deep Feature Synthesis [18], AutoFeat [12], and Cognito [24], apply various mathematical operations across all features to generate a large set of potential transformed features, followed by a reduction phase to select the most valuable ones. Although these methods capture complex feature interactions, they often rely on random generation, leading to computational inefficiency and instability because of the inclusion of many redundant features. (2) Iterative-Feedback Approaches: Methods like Group Feature Generation [40], Feature Engineering Automation [33, 42], and Genetic Programming-based techniques [39] combine feature generation and selection in iterative cycles, updating strategies based on model performance feedback. While guided by evolutionary algorithms or reinforcement learning, their reliance on discrete search spaces can hinder convergence, making scaling to larger feature spaces challenging and inefficient. (3) Neural Architecture Search (NAS)-Based Approaches: Inspired by NAS, which was originally designed to automate neural network architecture design, some studies have framed AFT as a problem within the NAS paradigm [3, 48, 41, 29]. These methods treat feature transformation sequences as hyperparameters within a model structure for optimization. Although this structured formulation guides the search, it often suffers from slow speeds and large parameter sizes, limiting efficiency and scalability.

To address these limitations, we introduce a novel, Generative Pre-trained Transformer framework for efficient Automated Feature Transformation (GPT-FT). Transformers provide strong sequence modeling and generation capabilities, enabling parallelization and improved parameter efficiency over traditional methods. Our framework includes four key steps: (1) Transformation Record Collection: We gather a dataset comprising feature transformation sequences and their corresponding model performance metrics. This dataset serves as the foundation for learning the relationship between transformation sequences and their impact on model performance. (2) Embedding Space Construction with revised GPT: We adopt the architecture of the GPT-1 model [35] and train it from scratch to regenerate transformation sequences. Notably, our model, GPT-FT, is significantly smaller than GPT-1 in terms of parameter size, with an embedding size of 64 compared to GPT-1’s 768. This step aims to two purposes: (a) feature transformation sequence reconstruction, which learns to generate valid and effective transformation sequences in an autoregressive manner; (b) model performance estimation and optimization, which predicts the performance impact of given transformation sequences to guide the optimization process. (3) Gradient-Ascent Search: We perform optimization in the continuous embedding space constructed by our GPT-FT model. By applying gradient-ascent techniques, we efficiently search for embeddings that are likely to yield improved model performance. (4) Autoregressive Reconstruction: The optimized embeddings are decoded back into feature transformation sequences using GPT-FT’s autoregressive capabilities. This results in refined feature spaces tailored for enhanced downstream model performance.

By integrating sequence reconstruction and performance estimation/enhancement tasks within our decoder-only GPT-FT model, our approach significantly reduces parameter size and computational overhead compared to traditional encoder-decoder methods. This streamlined, decoder-only structure minimizes the parameter requirements, enhancing scalability and making the framework suitable for large-scale and real-time applications. We evaluate our framework on benchmark datasets, where it matches or surpasses state-of-the-art methods and achieves significant computational efficiency, highlighting the advantages of transformer-based architectures for automated feature transformation.

Our contributions can be summarized as follows:

•

We introduce a novel framework, GPT-FT, that leverages the GPT model architecture for efficient automated feature transformation, addressing the scalability and efficiency challenges present in existing methods.

•

We show the dual capability of the GPT-FT model in reconstructing transformation sequences and estimating model performance within a unified architecture, enabling effective optimization in a continuous embedding space.

•

We show through extensive experiments that our framework achieves superior performance with reduced computational costs compared to state-of-the-art methods.

2 Problem Statement

Our objective is to provide a resilient, deeply differentiable system for automatic feature transformation. Considering a dataset $D=\{X,y\}$ and an operation set $\mathcal{O}$ , we develop a cascading reinforcement learning framework $\tau$ to collect training data $T=\{(\gamma_{i},v_{i})\}_{i=1}^{n}$ , where $\gamma_{i}$ denotes a sequence of feature transformations and $v_{i}$ indicates its predictive performance. Our framework concurrently optimises a mapping function $\phi$ , a reconstruction function $\psi$ , and an evaluation function $\omega$ to embed transformation sequences into a continuous space, linking each point with its corresponding sequence and performance metrics. Through gradient-based search in the embedding space, we determine the best transformation sequence $\Gamma^{*}$ , which may be expressed as follows:

[TABLE]

where $\tau$ transforms the original dataset feature $X$ into $\{\gamma_{i}\}^{n}_{i=1}$ , $\phi$ maps $\{\gamma_{i}\}^{n}_{i=1}$ to a continuous embedding space, and $\psi$ reconstructs a sequence of feature transformations from any embedding point; $\textbf{E}^{*}$ denotes the optimal embedding; $\mathcal{Q}$ represents the downstream machine learning model; and $\mathcal{P}$ indicates the performance metric. Ultimately, we employ $\Gamma^{*}$ to convert X into the optimal feature space $\textbf{X}^{*}$ , therefore maximising $\mathcal{P}$ .

3 Methodology

3.1 Framework Overview

Figure 1 shows the framework of GPT-FT with four steps:

(1) Transformation Records Collection.

(2) Embedding Space Construction with a revised GPT.

(3) Gradient-Ascent Search.

(4) Autoregressive Reconstruction.

In Step 1, we collect records of feature transformation sequences and their associated model performance using an RL-based framework, as described in [40]. In Step 2, our GPT-FT model encodes knowledge from the collected feature transformation records into a continuous embedding space. To achieve this, we minimize both the feature transformation sequence reconstruction loss and the model performance estimation loss. In Step 3, we initially acquire the embeddings of the highest-ranking transformation operation sequences through the well-trained GPT-FT. Using these embeddings as initial points, we explore the gradient generated by the GPT-FT to identify optimal embeddings that enhance model performance. In Step 4, the GPT-FT based Text Predictor decodes optimal embeddings to generate candidate feature transformation sequences. These sequences are applied to the original features to construct refined feature spaces. A downstream predictive model evaluates the quality of these spaces, and the feature space with the highest performance is selected as the optimal output.

3.2 Transformation Record Collection

To automatically collect a large volume of high-quality transformation records, we employ an RL-based feature transformation framework as data collector [40]. Specifically, the feature transformation process is modeled as three Markov Decision Processes (MDPs): a head feature agent, an operation agent, and a tail feature agent. These agents work collaboratively to select candidate features and mathematical operations for generating new features. The process is optimized to maximize downstream predictive performance while minimizing feature space redundancy. During this learning phase, transformation sequences and their corresponding model performance are collected to prepare data $T=\{(\gamma_{i},v_{i})\}^{n}_{i=1}$ where $\gamma_{i}$ is transformed feature sequence, $v_{i}$ is the corresponding downstream task performance and $n$ is the number of the pairs.

3.3 Embedding Space Construction with GPT

We use GPT-FT to map the sequential information of preprocessed features into an embedding space. Each feature is represented as a pair of a transformation operation sequence and its corresponding model performance. GPT-FT produces two outputs—text-based predictions ( $\hat{\gamma}_{i}$ ) and downstream task performance ( $\hat{v}_{i}$ )—leading to two distinct training objectives.

Target 1: Learning Continuous Embeddings. The first objective is to train GPT-FT to generate continuous embeddings that effectively represent the original dataset while reducing the search space. These continuous embeddings can be explored using gradient-based optimization. To achieve this, our GPT-FT uses the single-layer Embedding Generator $\phi$ (reduced from the original 12 layers in GPT [35]). We train the Embedding Generator $\phi$ alongside the Text Predictor $\psi$ , both utilizing the same input-output pairs transformed in Step 1. The forward process is expressed as $\hat{\gamma}_{i}=\psi(\textbf{E}_{i})$ , where $\textbf{E}_{i}=\phi(\gamma_{i})\in\mathbb{R}^{L\times d}$ , and $L$ denotes the length of $\gamma_{i}$ and $\hat{\gamma}_{i}$ . For the loss function, we assume the GPT-FT output follows a probability distribution centered on the input sequence $\gamma_{i}$ with unit variance. Accordingly, we employ the Negative Log-Likelihood (NLL) loss: $\mathcal{L}_{\text{pre}}=\sum_{i=1}^{n}-\log p(\hat{\gamma}_{i}|\gamma_{i})$ , where $\hat{\gamma}_{i}$ is the GPT-FT’s text-based output, $\gamma_{i}$ is the input sequence, and $n$ is the number of feature-performance pairs as defined in Section 3.2.

Target 2: Estimating Downstream Task Performance. The second objective is to train GPT-FT to estimate the downstream task performance, enabling gradient-based guidance for subsequent search steps. Here, the ground truth is the model performance $v_{i}$ (e.g., F1-score or $1-\text{RAE}$ ) from Step 1. We train the single-layer Embedding Generator $\phi$ alongside the Task Classifier $\delta$ in GPT-FT to predict performance values, formulated as $\hat{v}_{i}=\delta(\textbf{E}_{i})\in\mathbb{R}$ . The loss function for this objective is defined as $\mathcal{L}_{\text{cls}}=\sum_{i=1}^{n}\text{MSE}(\hat{v}_{i},v_{i})$ , where $\hat{v}_{i}$ is Task Classifier $\delta$ ’s predicted performance, and $v_{i}$ is the actual performance from Step 1. Both the Text Predictor and Task Classifier are implemented as single-layer linear transformations.

Joint Training Loss $\mathcal{L}$ : We jointly optimize the GPT-FT model. The joint training loss can be formulated as: $\mathcal{L}=\alpha\mathcal{L}_{pre}+(1-\alpha)\mathcal{L}_{cls}$ , where $\alpha$ is the trade-off hyperparameter that controls the contribution of sequence reconstruction and accuracy estimation loss.

3.4 Gradient-Ascent Search

To perform optimal embedding search, we first select the top- $k$ transformation sequences ranked by downstream predictive accuracy. The trained GPT-FT maps these postfix expressions to continuous embeddings, which serve as initial points for gradient ascent. Starting from an embedding E, the search updates as $\tilde{\textit{E}}=\textbf{E}+\eta\frac{\partial\textbf{G}}{\partial\textbf{E}}$ , where $\tilde{\textbf{E}}$ is the refined embedding, $\eta$ is the step size, and G represents GPT-FT. The performance satisfies $\textbf{G}(\tilde{\textbf{E}})\geq\textbf{G}(\textbf{E})$ . For $k$ seeds, the refined embeddings are $[\tilde{\textbf{E}}_{1},\tilde{\textbf{E}}_{2},\dots,\tilde{\textbf{E}}_{k}]$ .

3.5 Autoregressive Reconstruction

The trained Text Predictor $\psi$ in GPT-FT reconstructs transformation sequences from the candidate embeddings $[\tilde{\textbf{E}}_{1},\tilde{\textbf{E}}_{2},\dots,\tilde{\textbf{E}}_{k}]$ as $[\tilde{\textbf{E}}_{1},\dots,\tilde{\textbf{E}}_{k}]\xrightarrow{\psi}\{\tilde{\gamma}_{i}\}_{i=1}^{k}$ . The sequence with the highest probability is selected, generating $k$ transformation sequences $\{\tilde{\gamma}_{i}\}_{i=1}^{k}$ . Each sequence is segmented by the <SEP> token, with invalid segments removed based on mathematical computability. Valid components reconstruct feature transformation sequences $\{\tilde{\Gamma}_{i}\}_{i=1}^{k}$ , which refine the feature space $\{\tilde{X}_{i}\}_{i=1}^{k}$ . The feature set yielding the highest downstream performance is selected as the optimal feature space $\textbf{X}^{*}$ .

4 Experiment

4.1 Experimental Setup

Datasets and Evaluation Metrics We conducted experiments on 15 publicly available datasets from Kaggle [13], OpenML [34], and UCI [21], comprising nine classification and six regression tasks. Dataset statistics are summarized in Table 1. For classification tasks, we used F1-score, Precision, Recall, and ROC/AUC, while regression tasks were evaluated using 1-Relative Absolute Error (1-RAE) [40], 1-Mean Absolute Error (1-MAE), 1-Mean Square Error (1-MSE), and 1-Root Mean Square Error (1-RMSE).

Baseline Models We compared our method against nine prevalent feature generation techniques: (1) RDG generates transformation records of feature-operation-feature randomly to create a new feature space. (2) ERG applies operations to each feature to expand the feature space, then selects essential features as the new feature set. (3) LDA [1] employs matrix factorization to derive hidden states as the generated feature space. (4) AFAT [12] improves upon ERG by iteratively generating new features and using multi-step feature selection to identify informative ones. (5) NFS [3] models transformation sequences for each feature and optimizes feature generation using reinforcement learning. (6) TTG [23] conceptualizes the transformation process as a graph and applies reinforcement learning to search for the optimal feature set. (7) GRFG [40] uses three collaborative reinforced agents for feature generation and introduces feature grouping to improve learning efficiency. (8) DIFER [48] employs a seq2seq model to embed randomly generated feature transformations and applies gradient search to identify optimal features. (9) MOAT [41] uses an embedding-optimization-reconstruction framework to reformulate discrete feature transformations as a continuous optimization task, leveraging an encoder-evaluator-decoder structure to enhance data utilization from GRFG.

Experimental Platform To evaluate GPT-FT against baseline models, we present the results of quantitative and qualitative experiments. All experiments were conducted on an Intel Xeon Silver 4114 CPU and four NVIDIA TITAN RTX GPUs. Additional platform details are provided in Appendix 0.A.1.

4.2 Performance Evaluation

Overall Performance. This experiment evaluates GPT-FT’s ability to generate transformation sequences for identifying an optimal feature space with superior performance. Table 1 compares GPT-FT with other models on F1-score and 1-RAE metrics, showing GPT-FT outperforms all others across datasets. GPT-FT’s efficient embedding space preserves feature transformation knowledge, enabling its gradient-ascent module to locate the optimal feature space effectively. Compared to MOAT, GPT-FT achieves better stability due to: 1) RL-based data collection providing a solid foundation for a discriminative embedding space; 2) Postfix notation reducing the search space, improving transformation knowledge acquisition. This reflects GPT-FT’s efficacy.

Inference Time and Parameter Size. To facilitate a clear comparison of inference time and parameter size, we normalize their values to the range [0,1] using the min-max normalization approach for each dataset, with comprehensive values included in Appendix 0.A.3. Figure 2 shows GPT-FT consistently has smaller parameter sizes than MOAT across datasets, indicating greater design efficiency. For example, in the Amazon Employee dataset, GPT-FT’s size is 3.21 MB versus MOAT’s 6.46 MB (a 50% reduction), and in the Geographical Origin of Music dataset, GPT-FT uses 7.08 MB compared to MOAT’s 14.14 MB. Even in smaller datasets like Heart Disease, GPT-FT (0.10 MB) remains more compact than MOAT (0.20 MB). Figure 3 compares inference times, where GPT-FT consistently outperforms MOAT. For instance, in the Ozone Level Detection dataset, GPT-FT achieves an 18% improvement (27.93s vs. 34.11s), and in the Tecator dataset, it reduces inference time by 41% (39.42s vs. 67.22s). Even in smaller datasets like Heart Disease, GPT-FT (22.61s) is faster than MOAT (24.58s). These results highlight GPT-FT’s efficiency in both parameter size and inference speed, making it a strong choice for applications requiring optimized performance.

Robustness Check. This experiment evaluates GPT-FT’s robustness across various downstream machine learning models. We tested Random Forest (RF), XGBoost (XGB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Ridge, LASSO, and Decision Tree (DT), with results for Weather and Wine Quality Red datasets shown in Table 2 and Table 3, using 1-RAE and F1-score metrics, respectively. GPT-FT consistently beats MOAT across models, likely due to its RL-based data collector tailoring transformation records to the downstream model. The embedding space effectively captures model-specific characteristics, enabling optimal feature space generation. These results highlight GPT-FT’s robustness.

Ablation Study. To assess the impacts of Step 1: transformation records collection and Step 3: gradient ascent search in GPT-FT, we executed two experiments. Figure 4(a) illustrates the outcomes devoid of Step 1 (gathering of transformation records), whereby the original dataset substitutes the altered feature collection. Step 1 enhances performance in the Contraceptive Method Choice and Weather datasets but has little effect on the Titanic dataset, possibly because of the simplicity of Titanic’s characteristics, while the additional features in Step 1 facilitate GPT-FT’s acquisition of more complicated information in the other datasets. Figure 4(b) displays outcomes excluding Step 3 (gradient ascent search), with the gradient step established at 0. In the absence of Step 3, performance markedly declines in Contraceptive Method Choice and Weather, while seeing just a little reduction in Titanic. The embedding space for Titanic is probably near-optimal with small gradients, but greater gradients in the other datasets substantially enhance GPT-FT’s performance.

Parameter Sensitivity $\alpha$ . To validate the sensitivity of the trade-off parameter $\alpha$ in $\mathcal{L}=\alpha\mathcal{L}_{pre}+(1-\alpha)\mathcal{L}_{cls}$ (see Section 3.3), we varied $\alpha$ from 0.1 to 0.9 to observe its impact on training and performance. Lower $\alpha$ reduces the contribution of sequence reconstruction loss $\mathcal{L}_{pre}$ while allocating more gradient to accuracy estimation loss $\mathcal{L}_{cls}$ . Figure 5(a) shows $\mathcal{L}_{cls}$ is highly sensitive to $\alpha$ ; lower $\alpha$ leads to faster convergence, while high $\alpha$ (e.g., 0.9) causes a training barrier, delaying or preventing convergence. Meanwhile, it $\mathcal{L}_{pre}$ consistently decreases regardless of $\alpha$ , reaching a low value after 1000 epochs (Figure 5(b)). However, if training stops here, the target for $\mathcal{L}_{cls}$ is unfilled, providing poor gradients for subsequent steps. Performance-wise, $\alpha\in[0.4,0.9]$ fails to generate valid records, so we restricted $\alpha$ to [0.1, 0.3] and optimized it using NNI [32], setting $\alpha=0.133$ as the best value.

Parameter Sensitivity: number of embedding generator’s layer. To validate the sensitivity of the embedding generator’s layer count (see Section 3.3), we varied the number of layers from 1 to 5 and observed the training process and final performance. As shown in Figure 6(a), the differences are minimal, with a trend of faster convergence as the number of layers increases. Based on the observation, we select a single layer to minimize inference time and model size.

Parameter Sensitivity: GPT’s embedding size To validate the sensitivity of GPT’s embedding size, we varied it from 32 to 1024 and observed the training process and final performance. Figure 6(b) shows that larger embedding sizes lead to faster convergence, but performance remains consistent for sizes between 64 and 1024. At 32, occasional invalid records are generated. Considering performance stability and model size, we select an embedding size of 64 for our experiments.

5 Related Work

Automated Feature Transformation (AFT) enhances feature spaces by applying mathematical operations to original features [4, 25]. Existing methods fall into three categories:

Expansion-reduction approaches [18, 24, 11, 26, 22], which expand the feature space via explicit [20] or greedy [7] transformations, then reduce it by selecting useful features. However, these approaches struggle with evaluating complex transformations, leading to subpar performance.
Evolution-evaluation approaches [40, 23, 38, 43, 47, 44], which integrate feature generation and selection in a closed-loop system optimized by evolutionary algorithms or reinforcement learning. While effective, they remain time-consuming and unstable due to reliance on discrete decision-making.
AutoML-based approaches [3, 48], inspired by AutoML’s success [8, 27, 10, 19], formulate AFT as an AutoML task. However, these methods are limited by: 1) inability to produce high-order transformations; 2) unstable performance; and 3) reliance on discrete optimization. MOAT [41] was introduced to address these deficiencies by framing AFT as a continuous optimization problem. However, MOAT utilized an LSTM model, which is considerably larger and less efficient compared to GPT. The experimental section demonstrates that GPT-FT outperforms MOAT, exhibiting a smaller parameter size and reduced inference time.

6 Conclusion

In this paper, we introduced GPT-FT, a novel framework for efficient automated feature transformation leveraging the capabilities of Generative Pre-trained Transformers (GPT) [35]. By unifying transformation sequence reconstruction and model performance estimation within a single architecture, GPT-FT achieves a significant reduction in computational overhead and parameter size compared to existing methods. Through its four-stage process—transformation records collection, embedding space construction, gradient-ascent search, and autoregressive reconstruction, GPT-FT effectively addresses the scalability and efficiency challenges inherent in automated feature transformation.

Extensive experiments on benchmark datasets demonstrate that GPT-FT consistently outperforms state-of-the-art methods, achieving superior predictive performance while reducing inference time and model size. The robustness of GPT-FT across various machine learning models highlights its adaptability and practical utility for diverse applications. Furthermore, the integration of gradient-ascent search into the embedding space exemplifies the potential of continuous optimization techniques for feature engineering tasks.

Future work will extend GPT-FT to larger datasets and more complex feature spaces, while exploring advanced transformer architectures to enhance scalability. We also aim to integrate GPT-FT with privacy-preserving machine learning, where efficient encrypted computation could enable secure feature transformation [9] in sensitive domains. Finally, adopting the evaluation benchmark [30] for sequence reconstruction and cross-domain prompt recovery will further strengthen robustness, underscoring GPT-FT’s potential to advance automated machine learning pipelines

Appendix 0.A Experiment

0.A.1 Experiment Platform Information

All experiments were conducted on the Ubuntu 20.04.6 LTS operating system, Intel(R) Xeon(R) Silver 4114 CPU, and 4 NVIDIA TITAN RTX GPUs, with the framework of Python 3.8.5 and PyTorch 1.8.1.

0.A.2 Hyperparameter Settings

A single-layer embedding generator and a single-layer feed-forward network were employed for the text predictor and task classifier. The embedding size for all three models is 64. We utilized a single head for the self-attention block. In the training of GPT-FT, we established a batch size of 16, a learning rate of $1.31\times 10^{-5}$ , and a trade-off hyperparameter $\alpha$ set at 0.133. To infer new transformation sequences, we utilized the top 42 records as the foundational seeds.

0.A.3 Experiment Details

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), 993–1022 (2003)
2[2] Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Lukasik, S.: Seeds. UCI Machine Learning Repository (2010), DOI: https://doi.org/10.24432/C 5H 30K
3[3] Chen, X., Lin, Q., Luo, C., Li, X., Zhang, H., Xu, Y., Dang, Y., Sui, K., Zhang, X., Qiao, B., et al.: Neural feature search: A neural architecture for automated feature engineering. In: 2019 IEEE International Conference on Data Mining (ICDM). pp. 71–80. IEEE (2019)
4[4] Chen, Y.W., Song, Q., Hu, X.: Techniques for automated machine learning. ACM SIGKDD Explorations Newsletter 22 (2), 35–50 (2021)
5[5] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Wine Quality. UCI Machine Learning Repository (2009), DOI: https://doi.org/10.24432/C 56S 3T
6[6] Dias, D., Peres, S., Bscaro, H.: Libras Movement. UCI Machine Learning Repository (2009), DOI: https://doi.org/10.24432/C 5GC 82
7[7] Dor, O., Reich, Y.: Strengthening learning algorithms by feature discovery. Information Sciences 189 , 176–190 (2012)
8[8] Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. The Journal of Machine Learning Research 20 (1), 1997–2017 (2019)