Information-Theoretic Generative Clustering of Documents
Xin Du, Kumiko Tanaka-Ishii

TL;DR
This paper introduces generative clustering using large language models to cluster documents based on information-theoretic similarity, achieving state-of-the-art results and improving document retrieval accuracy.
Contribution
It proposes a novel generative clustering method leveraging LLMs and importance sampling, with rigorous information-theoretic similarity measures.
Findings
GC outperforms previous clustering methods significantly.
The approach improves generative document retrieval accuracy.
State-of-the-art performance demonstrated on benchmark datasets.
Abstract
We present {\em generative clustering} (GC) for clustering a set of documents, , by using texts generated by large language models (LLMs) instead of by clustering the original documents . Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Text Analysis Techniques · Semantic Web and Ontologies · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
