TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
Minjie Qiang, Mingming Zhang, Xiaoyi Bao, Xing Fu, Yu Cheng, Weiqiang Wang, Zhongqing Wang, Ningtao Wang

TL;DR
TabEmbed introduces a unified embedding model for tabular data, enabling classification and retrieval tasks by capturing structural and numerical features, and is evaluated on the new TabBench benchmark.
Contribution
The paper presents TabEmbed, the first generalist embedding model for tabular data that unifies classification and retrieval in a shared space, and introduces the TabBench benchmark.
Findings
TabEmbed outperforms existing text embedding models on TabBench.
Reformulating tabular tasks as semantic matching improves understanding.
Contrastive learning with hard negative mining enhances model performance.
Abstract
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
