HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization
Gabor Petnehazi, Bernadett Aradi

TL;DR
HERCULES is a hierarchical clustering method that leverages LLMs to generate interpretable summaries for clusters across various data modalities, enhancing understanding of complex datasets.
Contribution
The paper introduces HERCULES, a novel hierarchical clustering algorithm that integrates LLM-generated summaries to improve interpretability of clusters in diverse data types.
Findings
Effective clustering of text, images, and numeric data.
LLM-generated summaries significantly improve cluster interpretability.
Interactive visualization aids in data analysis and understanding.
Abstract
The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Mining Algorithms and Applications · Advanced Text Analysis Techniques
Methodsk-Means Clustering
