TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Minjie Qiang; Mingming Zhang; Xiaoyi Bao; Xing Fu; Yu Cheng; Weiqiang Wang; Zhongqing Wang; Ningtao Wang

arXiv:2605.04962·cs.CL·May 7, 2026

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Minjie Qiang, Mingming Zhang, Xiaoyi Bao, Xing Fu, Yu Cheng, Weiqiang Wang, Zhongqing Wang, Ningtao Wang

PDF

1 Repo 1 Datasets

TL;DR

TabEmbed introduces a unified embedding model for tabular data, enabling classification and retrieval tasks by capturing structural and numerical features, and is evaluated on the new TabBench benchmark.

Contribution

The paper presents TabEmbed, the first generalist embedding model for tabular data that unifies classification and retrieval in a shared space, and introduces the TabBench benchmark.

Findings

01

TabEmbed outperforms existing text embedding models on TabBench.

02

Reformulating tabular tasks as semantic matching improves understanding.

03

Contrastive learning with hard negative mining enhances model performance.

Abstract

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiangminjie27/TabEmbed
github

Datasets

qiangminjie27/TabBench
dataset· 297 dl
297 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.