A provable SVD-based algorithm for learning topics in dominant admixture corpus
Trapit Bansal, Chiranjib Bhattacharyya, Ravindran Kannan

TL;DR
This paper introduces a simple SVD-based algorithm with thresholding for learning topics in dominant admixture corpora, under realistic assumptions verified empirically, achieving provable accuracy and improved sample complexity.
Contribution
It proposes a novel, provable SVD-based method for topic inference under the realistic catchwords assumption, with near-optimal sample complexity and empirical validation.
Findings
Algorithm outperforms state-of-the-art methods on real data
Provably recovers topics with bounded $l_1$ error
Sample complexity is near optimal with respect to $w_0$
Abstract
Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from admixtures, is NP-hard. Assuming separability, a strong assumption, [4] gave the first provable algorithm for inference. For LDA model, [6] gave a provable algorithm using tensor-methods. But [4,6] do not learn topic vectors with bounded error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, group…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Tensor decomposition and applications · Algorithms and Data Compression
MethodsLinear Discriminant Analysis
