Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations
Xin Bing, Florentina Bunea, Seth Strimas-Mackey, Marten, Wegkamp

TL;DR
This paper develops sharp convergence rates for estimating sparse topic distributions in high-dimensional topic models, demonstrating the exact sparsity recovery of maximum likelihood estimators and applying these results to Wasserstein distance calculations.
Contribution
It provides novel finite-sample $ ext{L}_1$-norm convergence rates for topic proportion estimators, showing MLE can recover true sparsity without regularization and remains optimal when $A$ is unknown.
Findings
MLE estimator can exactly recover the true zero pattern of topic proportions.
The proposed estimators are minimax optimal and adaptive to unknown sparsity.
Application to Wasserstein distances enables new probabilistic document comparisons.
Abstract
This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in topic models. The data consists of observed multinomial counts of words across independent documents. In topic models, the expected word frequency matrix is assumed to be factorized as a word-topic matrix and a topic-document matrix . Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of are viewed as -dimensional mixture components that are common to all documents while columns of are viewed as the -dimensional mixture weights that are document specific and are allowed to be sparse. The main interest is to provide sharp, finite sample, -norm convergence rates for estimators of the mixture weights when is either known or unknown. For known , we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Topic Modeling · Statistical Methods and Inference
