Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
M\'arton Kardos

TL;DR
Topeax is a novel clustering topic model that improves cluster detection and term importance estimation by using density peaks and combining lexical-semantic indices, outperforming existing models like Top2Vec and BERTopic.
Contribution
Introduces Topeax, a new clustering topic model that enhances cluster detection and term importance estimation with density peak detection and combined lexical-semantic measures.
Findings
Better cluster recovery than Top2Vec and BERTopic
More robust to sample size and hyperparameter variations
Produces more coherent and trustworthy topics
Abstract
Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Computational and Text Analysis Methods
