GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement
Yang Gao, Dongjie Wang, Scott Piersall, Ye Zhang, Liqiang Wang

TL;DR
This paper introduces GPT-FT, a transformer-based framework that automates feature transformation for machine learning, achieving high performance with reduced computational costs through a novel multi-step process.
Contribution
The paper presents a new GPT-based method for automated feature transformation that improves efficiency and scalability over existing encoder-decoder approaches.
Findings
Matches or exceeds baseline performance on benchmarks
Significantly reduces computational costs
Enhances scalability of feature transformation processes
Abstract
Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a) feature transformation sequence…
| Dataset | Source | C/R | Samples | Features | RDG | ERG | LDA | AFAT | NFS | TTG | GRFG | DIFER | MOAT | GPT-FT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Contraceptive Method Choice [28] | UCIrvine | C | 1473 | 9 | 0.493 | 0.505 | 0.366 | 0.503 | 0.52 | 0.508 | 0.533 | 0.538 | 0.537 | 0.544 |
| Heart Disease [15] | UCIrvine | C | 303 | 13 | 0.851 | 0.850 | 0.763 | 0.834 | 0.834 | 0.868 | 0.831 | 0.841 | 0.866 | 0.867 |
| Ozone Level Detection [45] | UCIrvine | C | 2536 | 72 | 0.959 | 0.959 | 0.957 | 0.961 | 0.956 | 0.96 | 0.958 | 0.956 | 0.961 | 0.962 |
| Seeds [2] | UCIrvine | C | 210 | 7 | 0.969 | 0.971 | 0.736 | 0.971 | 0.971 | 0.965 | 0.926 | 0.957 | 0.971 | 0.977 |
| Titanic [17] | Kaggle | C | 891 | 11 | 0.814 | 0.829 | 0.736 | 0.818 | 0.82 | 0.814 | 0.82 | 0.825 | 0.831 | 0.832 |
| Lymphography [49] | UCIrvine | C | 148 | 18 | 0.108 | 0.144 | 0.167 | 0.15 | 0.152 | 0.148 | 0.182 | 0.15 | 0.267 | 0.352 |
| Amazon Employee [31] | Kaggle | C | 32769 | 9 | 0.932 | 0.934 | 0.916 | 0.93 | 0.932 | 0.933 | 0.932 | 0.929 | 0.936 | 0.983 |
| Wine Quality Red [5] | UCIrvine | C | 999 | 12 | 0.466 | 0.461 | 0.433 | 0.48 | 0.462 | 0.467 | 0.47 | 0.476 | 0.559 | 0.622 |
| Wine Quality White [5] | UCIrvine | C | 4900 | 12 | 0.524 | 0.510 | 0.449 | 0.516 | 0.525 | 0.507 | 0.534 | 0.507 | 0.536 | 0.544 |
| Tecator [37] | OpenML | R | 240 | 125 | 0.541 | 0.584 | 0.418 | 0.541 | 0.525 | 0.527 | 0.750 | 0.692 | 0.545 | 0.885 |
| Geographical OriginalofMusic [46] | UCIrvine | R | 1059 | 118 | 0.388 | 0.395 | 0.317 | 0.398 | 0.283 | 0.28 | 0.472 | 0.632 | 0.481 | 0.508 |
| Jasmine [36] | OpenML | R | 2984 | 145 | 0.402 | 0.415 | 0.391 | 0.411 | 0.406 | 0.407 | 0.326 | 0.447 | 0.407 | 0.477 |
| Libras move [6] | OpenML | R | 360 | 91 | 0.179 | 0.286 | 0.085 | 0.215 | 0.156 | 0.226 | 0.294 | 0.172 | 0.293 | 0.308 |
| Bodyfat [16] | UCIrvine | R | 252 | 15 | 0.84 | 0.84 | 0.282 | 0.846 | 0.848 | 0.853 | 0.652 | 0.737 | 0.843 | 0.865 |
| Weather [14] | Kaggle | R | 366 | 12 | 0.969 | 0.971 | 0.838 | 0.975 | 0.975 | 0.973 | 0.96 | 0.914 | 0.976 | 0.98 |
| Weather | RF | XGB | SVM | KNN | Ridge | LASSO | DT |
|---|---|---|---|---|---|---|---|
| RDG | 0.969 | 0.977 | 0.609 | 0.871 | 0.481 | 0.163 | 0.971 |
| ERG | 0.971 | 0.971 | 0.722 | 0.862 | 0.48 | 0.104 | 0.973 |
| LDA | 0.838 | 0.915 | 0.248 | 0.824 | 0.016 | 0.217 | 0.904 |
| AFAT | 0.975 | 0.971 | 0.629 | 0.854 | 0.474 | 0.209 | 0.976 |
| NFS | 0.975 | 0.974 | 0.614 | 0.865 | 0.202 | 0.132 | 0.976 |
| TTG | 0.975 | 0.97 | 0.571 | 0.873 | 0.197 | 0.198 | 0.978 |
| GRFG | 0.96 | 0.962 | 0.826 | 0.928 | 0.327 | 0.231 | 0.962 |
| DIFER | 0.914 | 0.906 | 0.712 | 0.9 | 0.461 | 0.217 | 0.905 |
| MOAT | 0.976 | 0.975 | 0.314 | 0.976 | 0.484 | 0.244 | 0.976 |
| GPT-FT | 0.98 | 0.98 | 0.831 | 0.978 | 0.493 | 0.251 | 0.981 |
| Wine Quality Red | RF | XGB | SVM | KNN | Ridge | LASSO | DT |
|---|---|---|---|---|---|---|---|
| RDG | 0.466 | 0.591 | 0.568 | 0.530 | 0.561 | 0.575 | 0.522 |
| ERG | 0.461 | 0.574 | 0.570 | 0.561 | 0.557 | 0.576 | 0.515 |
| LDA | 0.433 | 0.564 | 0.537 | 0.493 | 0.537 | 0.537 | 0.535 |
| AFAT | 0.480 | 0.564 | 0.356 | 0.436 | 0.522 | 0.509 | 0.490 |
| NFS | 0.462 | 0.561 | 0.559 | 0.530 | 0.573 | 0.583 | 0.468 |
| TTG | 0.467 | 0.585 | 0.560 | 0.540 | 0.560 | 0.575 | 0.532 |
| GRFG | 0.470 | 0.581 | 0.580 | 0.587 | 0.570 | 0.580 | 0.587 |
| DIFER | 0.476 | 0.576 | 0.538 | 0.538 | 0.587 | 0.587 | 0.516 |
| MOAT | 0.616 | 0.595 | 0.507 | 0.526 | 0.591 | 0.586 | 0.559 |
| GPT-FT | 0.622 | 0.596 | 0.599 | 0.587 | 0.593 | 0.598 | 0.594 |
| Dataset | Samples | Features | MOAT | GPT-FT |
|---|---|---|---|---|
| Contraceptive Method Choice | 1473 | 9 | 0.42 | 0.21 |
| Heart Disease | 303 | 13 | 0.20 | 0.10 |
| Ozone Level Detection | 2536 | 72 | 0.64 | 0.31 |
| Seeds | 210 | 7 | 0.32 | 0.16 |
| Titanic | 891 | 11 | 0.31 | 0.15 |
| Lymphography | 148 | 18 | 0.17 | 0.08 |
| Amazon Employee | 32769 | 9 | 6.46 | 3.21 |
| Wine Quality Red | 999 | 12 | 0.54 | 0.27 |
| Wine Quality White | 4900 | 12 | 1.27 | 0.63 |
| Tecator | 240 | 125 | 4.81 | 2.39 |
| GeographicalOriginalofMusic | 1059 | 118 | 14.14 | 7.08 |
| Jasmine | 2984 | 145 | 0.98 | 0.49 |
| Libras move | 360 | 91 | 0.38 | 0.19 |
| Bodyfat | 252 | 15 | 0.23 | 0.11 |
| Weather | 366 | 12 | 0.92 | 0.45 |
| Dataset | Samples | Features | MOAT | GPT-FT |
|---|---|---|---|---|
| Contraceptive Method Choice | 1473 | 9 | 23.83 | 23.30 |
| Heart Disease | 303 | 13 | 24.58 | 22.61 |
| Ozone Level Detection | 2536 | 72 | 34.11 | 27.93 |
| Seeds | 210 | 7 | 36.89 | 29.08 |
| Titanic | 891 | 11 | 25.59 | 23.56 |
| Lymphography | 148 | 18 | 27.21 | 24.44 |
| Amazon Employee | 32769 | 9 | 32.19 | 23.43 |
| Wine Quality Red | 999 | 12 | 27.35 | 23.94 |
| Wine Quality White | 4900 | 12 | 25.22 | 23.54 |
| Tecator | 240 | 125 | 67.22 | 39.42 |
| GeographicalOriginalofMusic | 1059 | 118 | 71.10 | 41.37 |
| Jasmine | 2984 | 145 | 57.57 | 43.02 |
| Libras move | 360 | 91 | 31.83 | 29.14 |
| Bodyfat | 252 | 15 | 34.67 | 28.35 |
| Weather | 366 | 12 | 24.12 | 22.79 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
11institutetext: University of Central Florida, Orlando FL 32816, USA
11email: {yang.gao,sc382961,liqiang.wang}@ucf.edu 22institutetext: University of Kansas, Lawrence KS 66045, USA
22email: [email protected]
33institutetext: Northeast Normal University, Changchun Jilin PRC
33email: [email protected]
GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement
Yang Gao 11
Dongjie Wang 22
Scott Piersall 11
Ye Zhang 33
Liqiang Wang 11
Abstract
Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a). feature transformation sequence reconstruction; and (b) model performance estimation and enhancement for downstream tasks by constructing the embedding space. Such a multi-objective optimization framework reduces parameter size and accelerates transformation processes. Experimental results on benchmark datasets show that the proposed framework matches or exceeds baseline performance, with significant gains in computational efficiency. This work highlights the potential of transformer-based architectures for scalable, high-performance automated feature transformation.
Keywords:
Automated Feature Transformation Generative Pre-Trained Transformer Multi-Objective Optimization.
1 Introduction
Feature transformation is a pivotal component in machine learning pipelines, aiming to enhance downstream tasks’ model performance by optimizing data representations. An effective feature transformation can significantly impact the predictive accuracy of models, especially in scenarios involving complex or high-dimensional datasets. Traditional approaches often rely on manual feature engineering, which is time-consuming and requires substantial domain expertise. This has spurred interests and studies in automated feature transformation (AFT) methods that can systematically and efficiently explore feature space.
The current algorithms for AFT can be broadly classified into three distinct categories: (1) Expansion-Reduction Methodologies: These approaches, such as Deep Feature Synthesis [18], AutoFeat [12], and Cognito [24], apply various mathematical operations across all features to generate a large set of potential transformed features, followed by a reduction phase to select the most valuable ones. Although these methods capture complex feature interactions, they often rely on random generation, leading to computational inefficiency and instability because of the inclusion of many redundant features. (2) Iterative-Feedback Approaches: Methods like Group Feature Generation [40], Feature Engineering Automation [33, 42], and Genetic Programming-based techniques [39] combine feature generation and selection in iterative cycles, updating strategies based on model performance feedback. While guided by evolutionary algorithms or reinforcement learning, their reliance on discrete search spaces can hinder convergence, making scaling to larger feature spaces challenging and inefficient. (3) Neural Architecture Search (NAS)-Based Approaches: Inspired by NAS, which was originally designed to automate neural network architecture design, some studies have framed AFT as a problem within the NAS paradigm [3, 48, 41, 29]. These methods treat feature transformation sequences as hyperparameters within a model structure for optimization. Although this structured formulation guides the search, it often suffers from slow speeds and large parameter sizes, limiting efficiency and scalability.
To address these limitations, we introduce a novel, Generative Pre-trained Transformer framework for efficient Automated Feature Transformation (GPT-FT). Transformers provide strong sequence modeling and generation capabilities, enabling parallelization and improved parameter efficiency over traditional methods. Our framework includes four key steps: (1) Transformation Record Collection: We gather a dataset comprising feature transformation sequences and their corresponding model performance metrics. This dataset serves as the foundation for learning the relationship between transformation sequences and their impact on model performance. (2) Embedding Space Construction with revised GPT: We adopt the architecture of the GPT-1 model [35] and train it from scratch to regenerate transformation sequences. Notably, our model, GPT-FT, is significantly smaller than GPT-1 in terms of parameter size, with an embedding size of 64 compared to GPT-1’s 768. This step aims to two purposes: (a) feature transformation sequence reconstruction, which learns to generate valid and effective transformation sequences in an autoregressive manner; (b) model performance estimation and optimization, which predicts the performance impact of given transformation sequences to guide the optimization process. (3) Gradient-Ascent Search: We perform optimization in the continuous embedding space constructed by our GPT-FT model. By applying gradient-ascent techniques, we efficiently search for embeddings that are likely to yield improved model performance. (4) Autoregressive Reconstruction: The optimized embeddings are decoded back into feature transformation sequences using GPT-FT’s autoregressive capabilities. This results in refined feature spaces tailored for enhanced downstream model performance.
By integrating sequence reconstruction and performance estimation/enhancement tasks within our decoder-only GPT-FT model, our approach significantly reduces parameter size and computational overhead compared to traditional encoder-decoder methods. This streamlined, decoder-only structure minimizes the parameter requirements, enhancing scalability and making the framework suitable for large-scale and real-time applications. We evaluate our framework on benchmark datasets, where it matches or surpasses state-of-the-art methods and achieves significant computational efficiency, highlighting the advantages of transformer-based architectures for automated feature transformation.
Our contributions can be summarized as follows:
- •
We introduce a novel framework, GPT-FT, that leverages the GPT model architecture for efficient automated feature transformation, addressing the scalability and efficiency challenges present in existing methods.
- •
We show the dual capability of the GPT-FT model in reconstructing transformation sequences and estimating model performance within a unified architecture, enabling effective optimization in a continuous embedding space.
- •
We show through extensive experiments that our framework achieves superior performance with reduced computational costs compared to state-of-the-art methods.
2 Problem Statement
Our objective is to provide a resilient, deeply differentiable system for automatic feature transformation. Considering a dataset and an operation set , we develop a cascading reinforcement learning framework to collect training data , where denotes a sequence of feature transformations and indicates its predictive performance. Our framework concurrently optimises a mapping function , a reconstruction function , and an evaluation function to embed transformation sequences into a continuous space, linking each point with its corresponding sequence and performance metrics. Through gradient-based search in the embedding space, we determine the best transformation sequence , which may be expressed as follows:
[TABLE]
where transforms the original dataset feature into , maps to a continuous embedding space, and reconstructs a sequence of feature transformations from any embedding point; denotes the optimal embedding; represents the downstream machine learning model; and indicates the performance metric. Ultimately, we employ to convert X into the optimal feature space , therefore maximising .
3 Methodology
3.1 Framework Overview
Figure 1 shows the framework of GPT-FT with four steps:
(1) Transformation Records Collection.
(2) Embedding Space Construction with a revised GPT.
(3) Gradient-Ascent Search.
(4) Autoregressive Reconstruction.
In Step 1, we collect records of feature transformation sequences and their associated model performance using an RL-based framework, as described in [40]. In Step 2, our GPT-FT model encodes knowledge from the collected feature transformation records into a continuous embedding space. To achieve this, we minimize both the feature transformation sequence reconstruction loss and the model performance estimation loss. In Step 3, we initially acquire the embeddings of the highest-ranking transformation operation sequences through the well-trained GPT-FT. Using these embeddings as initial points, we explore the gradient generated by the GPT-FT to identify optimal embeddings that enhance model performance. In Step 4, the GPT-FT based Text Predictor decodes optimal embeddings to generate candidate feature transformation sequences. These sequences are applied to the original features to construct refined feature spaces. A downstream predictive model evaluates the quality of these spaces, and the feature space with the highest performance is selected as the optimal output.
3.2 Transformation Record Collection
To automatically collect a large volume of high-quality transformation records, we employ an RL-based feature transformation framework as data collector [40]. Specifically, the feature transformation process is modeled as three Markov Decision Processes (MDPs): a head feature agent, an operation agent, and a tail feature agent. These agents work collaboratively to select candidate features and mathematical operations for generating new features. The process is optimized to maximize downstream predictive performance while minimizing feature space redundancy. During this learning phase, transformation sequences and their corresponding model performance are collected to prepare data where is transformed feature sequence, is the corresponding downstream task performance and is the number of the pairs.
3.3 Embedding Space Construction with GPT
We use GPT-FT to map the sequential information of preprocessed features into an embedding space. Each feature is represented as a pair of a transformation operation sequence and its corresponding model performance. GPT-FT produces two outputs—text-based predictions () and downstream task performance ()—leading to two distinct training objectives.
Target 1: Learning Continuous Embeddings. The first objective is to train GPT-FT to generate continuous embeddings that effectively represent the original dataset while reducing the search space. These continuous embeddings can be explored using gradient-based optimization. To achieve this, our GPT-FT uses the single-layer Embedding Generator (reduced from the original 12 layers in GPT [35]). We train the Embedding Generator alongside the Text Predictor , both utilizing the same input-output pairs transformed in Step 1. The forward process is expressed as , where , and denotes the length of and . For the loss function, we assume the GPT-FT output follows a probability distribution centered on the input sequence with unit variance. Accordingly, we employ the Negative Log-Likelihood (NLL) loss: , where is the GPT-FT’s text-based output, is the input sequence, and is the number of feature-performance pairs as defined in Section 3.2.
Target 2: Estimating Downstream Task Performance. The second objective is to train GPT-FT to estimate the downstream task performance, enabling gradient-based guidance for subsequent search steps. Here, the ground truth is the model performance (e.g., F1-score or ) from Step 1. We train the single-layer Embedding Generator alongside the Task Classifier in GPT-FT to predict performance values, formulated as . The loss function for this objective is defined as , where is Task Classifier ’s predicted performance, and is the actual performance from Step 1. Both the Text Predictor and Task Classifier are implemented as single-layer linear transformations.
Joint Training Loss : We jointly optimize the GPT-FT model. The joint training loss can be formulated as: , where is the trade-off hyperparameter that controls the contribution of sequence reconstruction and accuracy estimation loss.
3.4 Gradient-Ascent Search
To perform optimal embedding search, we first select the top- transformation sequences ranked by downstream predictive accuracy. The trained GPT-FT maps these postfix expressions to continuous embeddings, which serve as initial points for gradient ascent. Starting from an embedding E, the search updates as , where is the refined embedding, is the step size, and G represents GPT-FT. The performance satisfies . For seeds, the refined embeddings are .
3.5 Autoregressive Reconstruction
The trained Text Predictor in GPT-FT reconstructs transformation sequences from the candidate embeddings as . The sequence with the highest probability is selected, generating transformation sequences . Each sequence is segmented by the <SEP> token, with invalid segments removed based on mathematical computability. Valid components reconstruct feature transformation sequences , which refine the feature space . The feature set yielding the highest downstream performance is selected as the optimal feature space .
4 Experiment
4.1 Experimental Setup
Datasets and Evaluation Metrics We conducted experiments on 15 publicly available datasets from Kaggle [13], OpenML [34], and UCI [21], comprising nine classification and six regression tasks. Dataset statistics are summarized in Table 1. For classification tasks, we used F1-score, Precision, Recall, and ROC/AUC, while regression tasks were evaluated using 1-Relative Absolute Error (1-RAE) [40], 1-Mean Absolute Error (1-MAE), 1-Mean Square Error (1-MSE), and 1-Root Mean Square Error (1-RMSE).
Baseline Models We compared our method against nine prevalent feature generation techniques: (1) RDG generates transformation records of feature-operation-feature randomly to create a new feature space. (2) ERG applies operations to each feature to expand the feature space, then selects essential features as the new feature set. (3) LDA [1] employs matrix factorization to derive hidden states as the generated feature space. (4) AFAT [12] improves upon ERG by iteratively generating new features and using multi-step feature selection to identify informative ones. (5) NFS [3] models transformation sequences for each feature and optimizes feature generation using reinforcement learning. (6) TTG [23] conceptualizes the transformation process as a graph and applies reinforcement learning to search for the optimal feature set. (7) GRFG [40] uses three collaborative reinforced agents for feature generation and introduces feature grouping to improve learning efficiency. (8) DIFER [48] employs a seq2seq model to embed randomly generated feature transformations and applies gradient search to identify optimal features. (9) MOAT [41] uses an embedding-optimization-reconstruction framework to reformulate discrete feature transformations as a continuous optimization task, leveraging an encoder-evaluator-decoder structure to enhance data utilization from GRFG.
Experimental Platform To evaluate GPT-FT against baseline models, we present the results of quantitative and qualitative experiments. All experiments were conducted on an Intel Xeon Silver 4114 CPU and four NVIDIA TITAN RTX GPUs. Additional platform details are provided in Appendix 0.A.1.
4.2 Performance Evaluation
Overall Performance. This experiment evaluates GPT-FT’s ability to generate transformation sequences for identifying an optimal feature space with superior performance. Table 1 compares GPT-FT with other models on F1-score and 1-RAE metrics, showing GPT-FT outperforms all others across datasets. GPT-FT’s efficient embedding space preserves feature transformation knowledge, enabling its gradient-ascent module to locate the optimal feature space effectively. Compared to MOAT, GPT-FT achieves better stability due to: 1) RL-based data collection providing a solid foundation for a discriminative embedding space; 2) Postfix notation reducing the search space, improving transformation knowledge acquisition. This reflects GPT-FT’s efficacy.
Inference Time and Parameter Size. To facilitate a clear comparison of inference time and parameter size, we normalize their values to the range [0,1] using the min-max normalization approach for each dataset, with comprehensive values included in Appendix 0.A.3. Figure 2 shows GPT-FT consistently has smaller parameter sizes than MOAT across datasets, indicating greater design efficiency. For example, in the Amazon Employee dataset, GPT-FT’s size is 3.21 MB versus MOAT’s 6.46 MB (a 50% reduction), and in the Geographical Origin of Music dataset, GPT-FT uses 7.08 MB compared to MOAT’s 14.14 MB. Even in smaller datasets like Heart Disease, GPT-FT (0.10 MB) remains more compact than MOAT (0.20 MB). Figure 3 compares inference times, where GPT-FT consistently outperforms MOAT. For instance, in the Ozone Level Detection dataset, GPT-FT achieves an 18% improvement (27.93s vs. 34.11s), and in the Tecator dataset, it reduces inference time by 41% (39.42s vs. 67.22s). Even in smaller datasets like Heart Disease, GPT-FT (22.61s) is faster than MOAT (24.58s). These results highlight GPT-FT’s efficiency in both parameter size and inference speed, making it a strong choice for applications requiring optimized performance.
Robustness Check. This experiment evaluates GPT-FT’s robustness across various downstream machine learning models. We tested Random Forest (RF), XGBoost (XGB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Ridge, LASSO, and Decision Tree (DT), with results for Weather and Wine Quality Red datasets shown in Table 2 and Table 3, using 1-RAE and F1-score metrics, respectively. GPT-FT consistently beats MOAT across models, likely due to its RL-based data collector tailoring transformation records to the downstream model. The embedding space effectively captures model-specific characteristics, enabling optimal feature space generation. These results highlight GPT-FT’s robustness.
Ablation Study. To assess the impacts of Step 1: transformation records collection and Step 3: gradient ascent search in GPT-FT, we executed two experiments. Figure 4(a) illustrates the outcomes devoid of Step 1 (gathering of transformation records), whereby the original dataset substitutes the altered feature collection. Step 1 enhances performance in the Contraceptive Method Choice and Weather datasets but has little effect on the Titanic dataset, possibly because of the simplicity of Titanic’s characteristics, while the additional features in Step 1 facilitate GPT-FT’s acquisition of more complicated information in the other datasets. Figure 4(b) displays outcomes excluding Step 3 (gradient ascent search), with the gradient step established at 0. In the absence of Step 3, performance markedly declines in Contraceptive Method Choice and Weather, while seeing just a little reduction in Titanic. The embedding space for Titanic is probably near-optimal with small gradients, but greater gradients in the other datasets substantially enhance GPT-FT’s performance.
Parameter Sensitivity . To validate the sensitivity of the trade-off parameter in (see Section 3.3), we varied from 0.1 to 0.9 to observe its impact on training and performance. Lower reduces the contribution of sequence reconstruction loss while allocating more gradient to accuracy estimation loss . Figure 5(a) shows is highly sensitive to ; lower leads to faster convergence, while high (e.g., 0.9) causes a training barrier, delaying or preventing convergence. Meanwhile, it consistently decreases regardless of , reaching a low value after 1000 epochs (Figure 5(b)). However, if training stops here, the target for is unfilled, providing poor gradients for subsequent steps. Performance-wise, fails to generate valid records, so we restricted to [0.1, 0.3] and optimized it using NNI [32], setting as the best value.
Parameter Sensitivity: number of embedding generator’s layer. To validate the sensitivity of the embedding generator’s layer count (see Section 3.3), we varied the number of layers from 1 to 5 and observed the training process and final performance. As shown in Figure 6(a), the differences are minimal, with a trend of faster convergence as the number of layers increases. Based on the observation, we select a single layer to minimize inference time and model size.
Parameter Sensitivity: GPT’s embedding size To validate the sensitivity of GPT’s embedding size, we varied it from 32 to 1024 and observed the training process and final performance. Figure 6(b) shows that larger embedding sizes lead to faster convergence, but performance remains consistent for sizes between 64 and 1024. At 32, occasional invalid records are generated. Considering performance stability and model size, we select an embedding size of 64 for our experiments.
5 Related Work
Automated Feature Transformation (AFT) enhances feature spaces by applying mathematical operations to original features [4, 25]. Existing methods fall into three categories:
- Expansion-reduction approaches [18, 24, 11, 26, 22], which expand the feature space via explicit [20] or greedy [7] transformations, then reduce it by selecting useful features. However, these approaches struggle with evaluating complex transformations, leading to subpar performance.
- Evolution-evaluation approaches [40, 23, 38, 43, 47, 44], which integrate feature generation and selection in a closed-loop system optimized by evolutionary algorithms or reinforcement learning. While effective, they remain time-consuming and unstable due to reliance on discrete decision-making.
- AutoML-based approaches [3, 48], inspired by AutoML’s success [8, 27, 10, 19], formulate AFT as an AutoML task. However, these methods are limited by: 1) inability to produce high-order transformations; 2) unstable performance; and 3) reliance on discrete optimization. MOAT [41] was introduced to address these deficiencies by framing AFT as a continuous optimization problem. However, MOAT utilized an LSTM model, which is considerably larger and less efficient compared to GPT. The experimental section demonstrates that GPT-FT outperforms MOAT, exhibiting a smaller parameter size and reduced inference time.
6 Conclusion
In this paper, we introduced GPT-FT, a novel framework for efficient automated feature transformation leveraging the capabilities of Generative Pre-trained Transformers (GPT) [35]. By unifying transformation sequence reconstruction and model performance estimation within a single architecture, GPT-FT achieves a significant reduction in computational overhead and parameter size compared to existing methods. Through its four-stage process—transformation records collection, embedding space construction, gradient-ascent search, and autoregressive reconstruction, GPT-FT effectively addresses the scalability and efficiency challenges inherent in automated feature transformation.
Extensive experiments on benchmark datasets demonstrate that GPT-FT consistently outperforms state-of-the-art methods, achieving superior predictive performance while reducing inference time and model size. The robustness of GPT-FT across various machine learning models highlights its adaptability and practical utility for diverse applications. Furthermore, the integration of gradient-ascent search into the embedding space exemplifies the potential of continuous optimization techniques for feature engineering tasks.
Future work will extend GPT-FT to larger datasets and more complex feature spaces, while exploring advanced transformer architectures to enhance scalability. We also aim to integrate GPT-FT with privacy-preserving machine learning, where efficient encrypted computation could enable secure feature transformation [9] in sensitive domains. Finally, adopting the evaluation benchmark [30] for sequence reconstruction and cross-domain prompt recovery will further strengthen robustness, underscoring GPT-FT’s potential to advance automated machine learning pipelines
Appendix 0.A Experiment
0.A.1 Experiment Platform Information
All experiments were conducted on the Ubuntu 20.04.6 LTS operating system, Intel(R) Xeon(R) Silver 4114 CPU, and 4 NVIDIA TITAN RTX GPUs, with the framework of Python 3.8.5 and PyTorch 1.8.1.
0.A.2 Hyperparameter Settings
A single-layer embedding generator and a single-layer feed-forward network were employed for the text predictor and task classifier. The embedding size for all three models is 64. We utilized a single head for the self-attention block. In the training of GPT-FT, we established a batch size of 16, a learning rate of , and a trade-off hyperparameter set at 0.133. To infer new transformation sequences, we utilized the top 42 records as the foundational seeds.
0.A.3 Experiment Details
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), 993–1022 (2003)
- 2[2] Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Lukasik, S.: Seeds. UCI Machine Learning Repository (2010), DOI: https://doi.org/10.24432/C 5H 30K
- 3[3] Chen, X., Lin, Q., Luo, C., Li, X., Zhang, H., Xu, Y., Dang, Y., Sui, K., Zhang, X., Qiao, B., et al.: Neural feature search: A neural architecture for automated feature engineering. In: 2019 IEEE International Conference on Data Mining (ICDM). pp. 71–80. IEEE (2019)
- 4[4] Chen, Y.W., Song, Q., Hu, X.: Techniques for automated machine learning. ACM SIGKDD Explorations Newsletter 22 (2), 35–50 (2021)
- 5[5] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Wine Quality. UCI Machine Learning Repository (2009), DOI: https://doi.org/10.24432/C 56S 3T
- 6[6] Dias, D., Peres, S., Bscaro, H.: Libras Movement. UCI Machine Learning Repository (2009), DOI: https://doi.org/10.24432/C 5GC 82
- 7[7] Dor, O., Reich, Y.: Strengthening learning algorithms by feature discovery. Information Sciences 189 , 176–190 (2012)
- 8[8] Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. The Journal of Machine Learning Research 20 (1), 1997–2017 (2019)
