Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang; Xin Liao; Yiqun Zhang; Yiu-ming Cheung

arXiv:2601.01162·cs.LG·April 7, 2026

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

PDF

1 Repo

TL;DR

This paper introduces ARISE, a method that leverages Large Language Models to incorporate external semantic knowledge, improving clustering accuracy for categorical data by bridging the semantic gap.

Contribution

ARISE is the first approach to use LLMs for semantic-aware representations in categorical data clustering, enhancing cluster quality especially with limited data.

Findings

01

ARISE improves clustering accuracy by 19-27% over existing methods.

02

Experiments on eight datasets validate the effectiveness of LLM-enhanced embeddings.

03

ARISE consistently outperforms seven baseline methods across benchmarks.

Abstract

Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

develop-yang/ARISE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.