Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
Mingjie Zhao, Yunfan Zhang, Yiqun Zhang, Yiu-ming Cheung

TL;DR
This paper introduces TagCC, a novel clustering framework that integrates semantic knowledge from large language models into tabular data representations, significantly improving clustering performance.
Contribution
The paper proposes TagCC, a contrastive learning-based method that combines statistical tabular data with open-world semantic anchors derived from LLMs, enhancing clustering quality.
Findings
TagCC outperforms existing clustering methods on benchmark datasets.
Semantic anchors improve the semantic coherence of clusters.
Contrastive learning effectively integrates semantic knowledge into tabular representations.
Abstract
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
