Information-Theoretic Generative Clustering of Documents

Xin Du; Kumiko Tanaka-Ishii

arXiv:2412.13534·cs.LG·December 19, 2024

Information-Theoretic Generative Clustering of Documents

Xin Du, Kumiko Tanaka-Ishii

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces generative clustering using large language models to cluster documents based on information-theoretic similarity, achieving state-of-the-art results and improving document retrieval accuracy.

Contribution

It proposes a novel generative clustering method leveraging LLMs and importance sampling, with rigorous information-theoretic similarity measures.

Findings

01

GC outperforms previous clustering methods significantly.

02

The approach improves generative document retrieval accuracy.

03

State-of-the-art performance demonstrated on benchmark datasets.

Abstract

We present {\em generative clustering} (GC) for clustering a set of documents, $X$ , by using texts $Y$ generated by large language models (LLMs) instead of by clustering the original documents $X$ . Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kduxin/lmgc
pytorchOfficial

Videos

Information-Theoretic Generative Clustering of Documents· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques · Semantic Web and Ontologies · Natural Language Processing Techniques

MethodsSparse Evolutionary Training