Assigning Topics to Documents by Successive Projections

Olga Klopp (CREST); Maxim Panov (Skoltech); Suzanne Sigalla (CREST),; Alexandre Tsybakov (CREST)

arXiv:2107.03684·math.ST·July 9, 2021

Assigning Topics to Documents by Successive Projections

Olga Klopp (CREST), Maxim Panov (Skoltech), Suzanne Sigalla (CREST),, Alexandre Tsybakov (CREST)

PDF

TL;DR

This paper introduces a fast, simple algorithm called SPOC for assigning topics to documents, with strong theoretical guarantees and better scalability than traditional methods like LDA, demonstrated through synthetic data experiments.

Contribution

The paper presents the SPOC algorithm for topic assignment, providing theoretical performance bounds and a new method for estimating the number of topics.

Findings

01

SPOC algorithm is computationally fast and easy to implement.

02

Theoretical bounds show near-optimal performance of SPOC.

03

Error growth is logarithmic with dictionary size, outperforming LDA.

Abstract

Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.