Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery
Weicong Ding, Prakash Ishwar, Venkatesh Saligrama

TL;DR
This paper introduces necessary and sufficient conditions for separable topic models and proposes a provably efficient, distributed algorithm leveraging geometric properties of word co-occurrence matrices for scalable topic discovery.
Contribution
It provides a new geometric framework and an efficient algorithm with proven complexity bounds for identifying topics in separable models, improving scalability and theoretical understanding.
Findings
Algorithm is provably consistent and efficient
Achieves polynomial sample and computational complexity
Suitable for distributed, web-scale data mining
Abstract
We develop necessary and sufficient conditions and a novel provably consistent and efficient algorithm for discovering topics (latent factors) from observations (documents) that are realized from a probabilistic mixture of shared latent factors that have certain properties. Our focus is on the class of topic models in which each shared latent factor contains a novel word that is unique to that factor, a property that has come to be known as separability. Our algorithm is based on the key insight that the novel words correspond to the extreme points of the convex hull formed by the row-vectors of a suitably normalized word co-occurrence matrix. We leverage this geometric insight to establish polynomial computation and sample complexity bounds based on a few isotropic random projections of the rows of the normalized word co-occurrence matrix. Our proposed random-projections-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Mining Algorithms and Applications · Complex Network Analysis Techniques
