Enriching Tabular Data with Contextual LLM Embeddings: A Comprehensive Ablation Study for Ensemble Classifiers
Gjergji Kasneci, Enkelejda Kasneci

TL;DR
This paper explores enriching tabular data with large language model embeddings to improve ensemble classifier performance, demonstrating significant gains especially on imbalanced or limited datasets through systematic ablation studies.
Contribution
It introduces a structured method for integrating LLM-derived features into tabular data and evaluates their impact on ensemble classifiers across multiple datasets.
Findings
LLM embeddings often rank among top features for prediction.
Embedding integration improves classifier performance on imbalanced datasets.
XGBoost and CatBoost benefit most from feature enrichment.
Abstract
Feature engineering is crucial for optimizing machine learning model performance, particularly in tabular data classification tasks. Leveraging advancements in natural language processing, this study presents a systematic approach to enrich tabular datasets with features derived from large language model embeddings. Through a comprehensive ablation study on diverse datasets, we assess the impact of RoBERTa and GPT-2 embeddings on ensemble classifiers, including Random Forest, XGBoost, and CatBoost. Results indicate that integrating embeddings with traditional numerical and categorical features often enhances predictive performance, especially on datasets with class imbalance or limited features and samples, such as UCI Adult, Heart Disease, Titanic, and Pima Indian Diabetes, with improvements particularly notable in XGBoost and CatBoost classifiers. Additionally, feature importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Neural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Dense Connections · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · WordPiece · Linear Warmup With Cosine Annealing
