Sparse topic modeling via spectral decomposition and thresholding
Huy Tran, Yating Liu, Claire Donnat

TL;DR
This paper introduces a new spectral thresholding method for sparse topic modeling that effectively estimates the topic-word matrix even in high-dimensional settings and with non-separable data, outperforming existing methods.
Contribution
The paper proposes a novel spectral procedure incorporating thresholding based on word frequencies, addressing high-dimensional and non-separable data scenarios in topic modeling.
Findings
Error rate depends logarithmically on vocabulary size
Method is computationally fast and consistent across various regimes
Performs well on research papers, single-cell, and microbiome data
Abstract
The probabilistic Latent Semantic Indexing model assumes that the expectation of the corpus matrix is low-rank and can be written as the product of a topic-word matrix and a word-document matrix. In this paper, we study the estimation of the topic-word matrix under the additional assumption that the ordered entries of its columns rapidly decay to zero. This sparsity assumption is motivated by the empirical observation that the word frequencies in a text often adhere to Zipf's law. We introduce a new spectral procedure for estimating the topic-word matrix that thresholds words based on their corpus frequencies, and show that its -error rate under our sparsity assumption depends on the vocabulary size only via a logarithmic term. Our error bound is valid for all parameter regimes and in particular for the setting where is extremely large; this high-dimensional setting is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Gene expression and cancer classification · Text and Document Classification Technologies
