TL;DR
PRISM is a novel corpus-intrinsic method for initializing LDA in topic modeling, improving coherence and interpretability without external knowledge, suitable for resource-limited domains.
Contribution
It introduces a corpus-based Dirichlet parameter derived from word co-occurrence, enhancing LDA initialization without modifying its core generative process.
Findings
PRISM improves topic coherence and interpretability in text and RNA-seq data.
It rivals models that incorporate external knowledge.
Code is available at the provided GitHub URL.
Abstract
Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
