Topic Discovery in Massive Text Corpora Based on Min-Hashing

Gibran Fuentes-Pineda; Ivan Vladimir Meza-Ruiz

arXiv:1807.00938·cs.CL·August 8, 2019

Topic Discovery in Massive Text Corpora Based on Min-Hashing

Gibran Fuentes-Pineda, Ivan Vladimir Meza-Ruiz

PDF

3 Repos

TL;DR

This paper introduces a Min-Hashing based method for discovering topics in massive text corpora that is scalable, does not require predefining the number of topics, and produces coherent sets of highly co-occurring words.

Contribution

The paper presents a novel Min-Hashing approach for topic discovery that handles large datasets efficiently without fixing the number of topics beforehand.

Findings

01

Effective in discovering meaningful topics across various large corpora

02

Linear time complexity with respect to corpus and vocabulary size

03

Able to process entire Wikipedia in under 7 hours

Abstract

The task of discovering topics in text corpora has been dominated by Latent Dirichlet Allocation and other Topic Models for over a decade. In order to apply these approaches to massive text corpora, the vocabulary needs to be reduced considerably and large computer clusters and/or GPUs are typically required. Moreover, the number of topics must be provided beforehand but this depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both topic quality and time complexity are sensitive to this choice. This paper describes an alternative approach to discover topics based on Min-Hashing, which can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance. The basic idea is to generate multiple random partitions of the corpus vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.