ClusterLLM: Large Language Models as a Guide for Text Clustering
Yuwei Zhang, Zihan Wang, Jingbo Shang

TL;DR
ClusterLLM introduces a novel framework that uses instruction-tuned large language models like ChatGPT to improve text clustering by leveraging feedback and user preferences, outperforming traditional methods.
Contribution
It presents a new approach that utilizes LLM feedback for clustering, enabling understanding of user preferences and improving clustering quality without access to embeddings.
Findings
Consistently improves clustering quality across 14 datasets.
Cost-effective with an average cost of ~$0.6 per dataset.
Effective for fine-tuning small embedders using ChatGPT feedback.
Abstract
We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon "small" embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user's preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions <does A better correspond to B than C>, where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
