A provable SVD-based algorithm for learning topics in dominant admixture   corpus

Trapit Bansal; Chiranjib Bhattacharyya; Ravindran Kannan

arXiv:1410.6991·stat.ML·November 5, 2014·36 cites

A provable SVD-based algorithm for learning topics in dominant admixture corpus

Trapit Bansal, Chiranjib Bhattacharyya, Ravindran Kannan

PDF

Open Access

TL;DR

This paper introduces a simple SVD-based algorithm with thresholding for learning topics in dominant admixture corpora, under realistic assumptions verified empirically, achieving provable accuracy and improved sample complexity.

Contribution

It proposes a novel, provable SVD-based method for topic inference under the realistic catchwords assumption, with near-optimal sample complexity and empirical validation.

Findings

01

Algorithm outperforms state-of-the-art methods on real data

02

Provably recovers topics with bounded $l_1$ error

03

Sample complexity is near optimal with respect to $w_0$

Abstract

Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from admixtures, is NP-hard. Assuming separability, a strong assumption, [4] gave the first provable algorithm for inference. For LDA model, [6] gave a provable algorithm using tensor-methods. But [4,6] do not learn topic vectors with bounded $l_{1}$ error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded $l_{1}$ error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, group…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Algorithms and Data Compression

MethodsLinear Discriminant Analysis