Generalized Topic Modeling
Avrim Blum, Nika Haghtalab

TL;DR
This paper introduces a broad generalization of topic modeling where words are sequences of paragraphs, focusing on directly learning document classifiers without explicitly modeling complex topic distributions.
Contribution
It proposes a new framework for topic modeling that handles complex paragraph sequences and directly predicts topic mixtures, extending traditional models.
Findings
Efficient algorithms under natural conditions for the generalized model.
Analysis of noise tolerance and sample complexity.
Connection to multi-view and co-training frameworks.
Abstract
Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution over words, and a document is generated by first selecting a mixture over topics, and then generating words i.i.d. from the associated mixture . Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
