Generalized Topic Modeling

Avrim Blum; Nika Haghtalab

arXiv:1611.01259·cs.LG·November 7, 2016·2 cites

Generalized Topic Modeling

Avrim Blum, Nika Haghtalab

PDF

Open Access

TL;DR

This paper introduces a broad generalization of topic modeling where words are sequences of paragraphs, focusing on directly learning document classifiers without explicitly modeling complex topic distributions.

Contribution

It proposes a new framework for topic modeling that handles complex paragraph sequences and directly predicts topic mixtures, extending traditional models.

Findings

01

Efficient algorithms under natural conditions for the generalized model.

02

Analysis of noise tolerance and sample complexity.

03

Connection to multi-view and co-training frameworks.

Abstract

Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution $a_{i}$ over words, and a document is generated by first selecting a mixture $w$ over topics, and then generating words i.i.d. from the associated mixture $A w$ . Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies