Probabilistic Latent Semantic Analysis
Thomas Hofmann

TL;DR
Probabilistic Latent Semantic Analysis introduces a statistically grounded, mixture-based approach to analyze co-occurrence data, improving upon traditional linear algebra methods like SVD in information retrieval and NLP tasks.
Contribution
It presents a novel probabilistic model for latent semantic analysis, replacing linear algebra with a mixture decomposition rooted in statistical principles.
Findings
Outperforms standard Latent Semantic Analysis in experiments
Provides a more principled, statistically sound framework
Reduces overfitting through tempered EM
Abstract
Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Text Analysis Techniques · Natural Language Processing Techniques
