TL;DR
This paper introduces a Min-Hashing based method for discovering topics in massive text corpora that is scalable, does not require predefining the number of topics, and produces coherent sets of highly co-occurring words.
Contribution
The paper presents a novel Min-Hashing approach for topic discovery that handles large datasets efficiently without fixing the number of topics beforehand.
Findings
Effective in discovering meaningful topics across various large corpora
Linear time complexity with respect to corpus and vocabulary size
Able to process entire Wikipedia in under 7 hours
Abstract
The task of discovering topics in text corpora has been dominated by Latent Dirichlet Allocation and other Topic Models for over a decade. In order to apply these approaches to massive text corpora, the vocabulary needs to be reduced considerably and large computer clusters and/or GPUs are typically required. Moreover, the number of topics must be provided beforehand but this depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both topic quality and time complexity are sensitive to this choice. This paper describes an alternative approach to discover topics based on Min-Hashing, which can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance. The basic idea is to generate multiple random partitions of the corpus vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
