MPTopic: Improving topic modeling via Masked Permuted pre-training
Xinche Zhang, Evangelos milios

TL;DR
This paper introduces MPTopic, a novel topic modeling approach that leverages TF-RDF for improved clustering, resulting in more accurate and meaningful topic keywords compared to existing methods like BERTopic and Top2Vec.
Contribution
The paper proposes TF-RDF and MPTopic, a new clustering-based topic modeling framework that enhances topic quality by better assessing term relevance within documents.
Findings
MPTopic with TF-RDF outperforms BERTopic and Top2Vec in keyword quality.
The new method improves clustering accuracy in topic modeling.
Experimental results demonstrate superior relevance of identified topics.
Abstract
Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Text and Document Classification Technologies
