CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval
Tsung-Hsiang Chou, Chen-Jui Yu, Shui-Hsiang Hsu, Yao-Chung Fan

TL;DR
CGPT introduces a novel training framework that leverages LLM-generated synthetic queries and clustering-based partial tables to significantly improve large-scale table retrieval performance across multiple benchmarks.
Contribution
The paper proposes a new method that uses clustering and LLM-generated supervision to enhance table retrieval, outperforming existing baselines and demonstrating strong cross-domain generalization.
Findings
Achieves an average R@1 improvement of 16.54% over baselines.
Outperforms previous methods like QGpT on four public benchmarks.
Remains effective with smaller LLMs for synthetic query generation.
Abstract
General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Handwritten Text Recognition Techniques · Topic Modeling
