Summaries as Centroids for Interpretable and Scalable Text Clustering
Jairo Diaz-Rodriguez

TL;DR
This paper proposes a novel text clustering method that uses human-readable summaries as centroids, combining interpretability with scalability, and demonstrates its effectiveness across various datasets and streaming scenarios.
Contribution
It introduces k-NLPmeans and k-LLMmeans, innovative clustering algorithms that replace numeric centroids with textual summaries, enhancing interpretability without sacrificing accuracy.
Findings
Outperforms classical clustering baselines.
Approaches the accuracy of LLM-based clustering.
Effective for streaming text data.
Abstract
We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposal of using summaries as centroids in k-means clustering looks like a very innovative and easy to implement approach. - The authors included diverse ways of computing the summary centroid besides simply querying an LLM such as TextRank, SVD etc which can provide computationally cheaper alternatives. - The approach shows good gains in performance over standard k-means clustering demonstrated consistently across many datasets and while using many different embedding models (Table1-4).
Not much of a weakness but a suggestion : I would suggest elaborating on the NMI metric to give the readers a brief explanation of how it is calculated.
In this manuscript, the authors aim to address the problems of poor interpretability of numeric centroids and high scalability costs in traditional text clustering methods. Specifically, the proposed k-NLPmeans uses lightweight and deterministic classical NLP summarizers to periodically replace numeric centroids with textual summaries. The proposed k-LLMmeans leverages LLMs for summaries under a fixed per-iteration budget. Experimental results across diverse datasets and embedding models show th
There are some concerns for the manuscript as follows: 1.How to set k in the experiments? The influence of k in the k-means to the experimental results is not discussed. 2.In the example of Figure 1, it is based on the results of k-means. However, in the proposed method, the authors proposed new summarization to compute a textual prototype in place of the standard centroid update. Thus, how the proposed method guarantee that the instances in the same cluster can be used to generate promising su
- **Simple but novel idea:** The notion of introducing interpretable textual centroids inside the k-means loop is elegant, practical, and original. It creates a direct, auditable link between cluster means and human-interpretable summaries. - **Interpretability without post-hoc processing:** Unlike topic models or LLM-based pipelines that only label clusters afterward, the prototype is the cluster, which is useful for debugging, transparency, and downstream analyst workflows. - **Low-resource ap
- **Summarization hints:** Performance depends on the summarizer, especially in heterogeneous clusters. The paper tests several summarizers but does not provide guidance on when one strategy is preferable (e.g., extractive vs. abstractive by dataset characteristics). - **Missing comparison on interpretability:** Interpretability is a key selling point, but comparisons are mostly against centroid-based clustering. Topic-model-style baselines (e.g., BERTopic,) would give a fairer interpretability
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
