SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection
Maksim E. Eren, Nick Solovyev, Manish Bhattarai, Kim Rasmussen,, Charles Nicholas, Boian S. Alexandrov

TL;DR
SeNMFk-SPLIT is a scalable distributed approach for semantic topic modeling of large text corpora, improving upon previous NMF-based methods by enabling joint factorization of large matrices for comprehensive analysis.
Contribution
It introduces a novel distributed method, SeNMFk-SPLIT, allowing large-scale semantic topic modeling by separately factorizing matrices, suitable for extensive datasets like arXiv AI/ML literature.
Findings
Successfully applied to entire arXiv AI/ML literature
Enables joint factorization of large matrices
Improves scalability of semantic NMF methods
Abstract
As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Advanced Text Analysis Techniques
