How Many Topics? Stability Analysis for Topic Models
Derek Greene, Derek O'Callaghan, P\'adraig Cunningham

TL;DR
This paper introduces a stability analysis method to determine the optimal number of topics in topic modeling, enhancing model robustness and guiding better topic number selection.
Contribution
It proposes a novel term-centric stability analysis strategy for selecting the appropriate number of topics in topic models based on matrix factorization.
Findings
The stability strategy effectively guides topic number selection.
The method improves robustness of topic models across different corpora.
Empirical results demonstrate better model quality with the proposed approach.
Abstract
Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Expert finding and Q&A systems
