Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
Tunazzina Islam

TL;DR
This paper introduces a reasoning-based framework using large language models to validate and refine unsupervised text clusters, improving coherence, reducing redundancy, and generating interpretable labels without supervision.
Contribution
It presents a novel LLM-driven reasoning framework that decouples validation from representation learning, enhancing the quality and interpretability of unsupervised text clustering.
Findings
Improved cluster coherence and human-aligned labeling quality over classical models.
Consistent performance improvements across social media corpora from different platforms.
Demonstrated robustness and cross-platform stability of the proposed framework.
Abstract
Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms. Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels through a two-stage process that generates and consolidates semantically similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
