On Smoothing and Inference for Topic Models
Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh

TL;DR
This paper compares various algorithms for topic modeling, revealing that differences mainly stem from smoothing levels, and demonstrates how optimized methods can quickly produce accurate models on large text datasets.
Contribution
It provides a detailed empirical comparison of topic modeling algorithms, highlighting the impact of smoothing and hyperparameter optimization on their performance.
Findings
Differences among algorithms are mainly due to smoothing levels.
Optimized hyperparameters reduce performance disparities.
Accurate topic models can be learned in seconds on large corpora.
Abstract
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Topic Modeling · Music and Audio Processing
