Inference and Evaluation of the Multinomial Mixture Model for Text   Clustering

Lo\"is Rigouste (TSI); Olivier Capp\'e (TSI); Fran\c{c}ois Yvon (TSI)

arXiv:cs/0606069·cs.IR·August 16, 2016

Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

Lo\"is Rigouste (TSI), Olivier Capp\'e (TSI), Fran\c{c}ois Yvon (TSI)

PDF

TL;DR

This paper explores probabilistic multinomial mixture models for text clustering, comparing inference methods like EM and Gibbs sampling, and proposing evaluation criteria and heuristics to improve clustering accuracy in high-dimensional settings.

Contribution

It introduces a systematic evaluation framework for text clustering, compares EM and Gibbs sampling inference methods, and proposes heuristics for high-dimensional parameter estimation.

Findings

01

Gibbs sampling outperforms EM in high-dimensional scenarios.

02

Initialization and feature choices significantly affect clustering results.

03

Heuristic vocabulary reduction improves inference efficiency.

Abstract

In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.