Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design
Oksana Kolomenko, Ricardo Knauer, Erik Rodner

TL;DR
This paper systematically benchmarks 256 LLM-based embedding pipelines for tabular data, revealing best practices and factors influencing predictive performance, such as concatenation over replacement and model size.
Contribution
It provides the first comprehensive benchmark of embedding pipeline configurations for tabular prediction, offering practical guidelines for effective design.
Findings
Concatenating embeddings generally outperforms replacing original columns.
Larger embedding models tend to improve performance.
Public leaderboard rankings are poor indicators of embedding quality.
Abstract
Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Explainable Artificial Intelligence (XAI) · Topic Modeling
