TL;DR
This paper introduces ARISE, a method that leverages Large Language Models to incorporate external semantic knowledge, improving clustering accuracy for categorical data by bridging the semantic gap.
Contribution
ARISE is the first approach to use LLMs for semantic-aware representations in categorical data clustering, enhancing cluster quality especially with limited data.
Findings
ARISE improves clustering accuracy by 19-27% over existing methods.
Experiments on eight datasets validate the effectiveness of LLM-enhanced embeddings.
ARISE consistently outperforms seven baseline methods across benchmarks.
Abstract
Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
