Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
Lo\"is Rigouste (TSI), Olivier Capp\'e (TSI), Fran\c{c}ois Yvon (TSI)

TL;DR
This paper explores probabilistic multinomial mixture models for text clustering, comparing inference methods like EM and Gibbs sampling, and proposing evaluation criteria and heuristics to improve clustering accuracy in high-dimensional settings.
Contribution
It introduces a systematic evaluation framework for text clustering, compares EM and Gibbs sampling inference methods, and proposes heuristics for high-dimensional parameter estimation.
Findings
Gibbs sampling outperforms EM in high-dimensional scenarios.
Initialization and feature choices significantly affect clustering results.
Heuristic vocabulary reduction improves inference efficiency.
Abstract
In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
