ClusterLLM: Large Language Models as a Guide for Text Clustering

Yuwei Zhang; Zihan Wang; Jingbo Shang

arXiv:2305.14871·cs.CL·November 7, 2023·2 cites

ClusterLLM: Large Language Models as a Guide for Text Clustering

Yuwei Zhang, Zihan Wang, Jingbo Shang

PDF

Open Access 1 Repo

TL;DR

ClusterLLM introduces a novel framework that uses instruction-tuned large language models like ChatGPT to improve text clustering by leveraging feedback and user preferences, outperforming traditional methods.

Contribution

It presents a new approach that utilizes LLM feedback for clustering, enabling understanding of user preferences and improving clustering quality without access to embeddings.

Findings

01

Consistently improves clustering quality across 14 datasets.

02

Cost-effective with an average cost of ~$0.6 per dataset.

03

Effective for fine-tuning small embedders using ChatGPT feedback.

Abstract

We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon "small" embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user's preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions <does A better correspond to B than C>, where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhang-yu-wei/clusterllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques