Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Gibran Fuentes-Pineda; Ivan Vladimir Meza-Ruiz

arXiv:1509.01771·cs.LG·September 9, 2015

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Gibran Fuentes-Pineda, Ivan Vladimir Meza-Ruiz

PDF

1 Repo

TL;DR

SWMH is a novel randomized method for large-scale topic mining that produces ordered vocabulary subsets, capturing themes at various granularities, and is evaluated on multiple large corpora.

Contribution

Introduces Sampled Weighted Min-Hashing, a new approach for scalable, ordered topic extraction from large text corpora, outperforming existing methods in quality.

Findings

01

Effective on large datasets like Wikipedia and Reuters

02

Produces meaningful, multi-granularity topics

03

Outperforms Online LDA in classification tasks

Abstract

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gibranfp/Sampled-MinHashing
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Discriminant Analysis