Sparse topic modeling via spectral decomposition and thresholding

Huy Tran; Yating Liu; Claire Donnat

arXiv:2310.06730·stat.ME·October 11, 2023

Sparse topic modeling via spectral decomposition and thresholding

Huy Tran, Yating Liu, Claire Donnat

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new spectral thresholding method for sparse topic modeling that effectively estimates the topic-word matrix even in high-dimensional settings and with non-separable data, outperforming existing methods.

Contribution

The paper proposes a novel spectral procedure incorporating thresholding based on word frequencies, addressing high-dimensional and non-separable data scenarios in topic modeling.

Findings

01

Error rate depends logarithmically on vocabulary size

02

Method is computationally fast and consistent across various regimes

03

Performs well on research papers, single-cell, and microbiome data

Abstract

The probabilistic Latent Semantic Indexing model assumes that the expectation of the corpus matrix is low-rank and can be written as the product of a topic-word matrix and a word-document matrix. In this paper, we study the estimation of the topic-word matrix under the additional assumption that the ordered entries of its columns rapidly decay to zero. This sparsity assumption is motivated by the empirical observation that the word frequencies in a text often adhere to Zipf's law. We introduce a new spectral procedure for estimating the topic-word matrix that thresholds words based on their corpus frequencies, and show that its $ℓ_{1}$ -error rate under our sparsity assumption depends on the vocabulary size $p$ only via a logarithmic term. Our error bound is valid for all parameter regimes and in particular for the setting where $p$ is extremely large; this high-dimensional setting is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yatingliu2548/topic-modeling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Gene expression and cancer classification · Text and Document Classification Technologies