Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions
Justin Miller, Tristram Alexander

TL;DR
This paper introduces a systematic approach for determining the number of clusters in short-text clustering tasks by analyzing cluster stability across multiple resolutions, using a new metric and visualization tools.
Contribution
It presents a novel method for assessing cluster robustness and selecting cluster numbers in short-text clustering, inspired by bioinformatics techniques.
Findings
Proportional stability metric effectively reveals cluster stability across resolutions.
Sankey diagrams provide intuitive visualization of cluster evolution.
Multiple resolutions offer richer insights than single-metric approaches.
Abstract
Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods
