PRISM: PRIor from corpus Statistics for topic Modeling

Tal Ishon; Yoav Goldberg; Uri Shaham

arXiv:2603.29406·cs.LG·April 1, 2026

PRISM: PRIor from corpus Statistics for topic Modeling

Tal Ishon, Yoav Goldberg, Uri Shaham

PDF

1 Repo

TL;DR

PRISM is a novel corpus-intrinsic method for initializing LDA in topic modeling, improving coherence and interpretability without external knowledge, suitable for resource-limited domains.

Contribution

It introduces a corpus-based Dirichlet parameter derived from word co-occurrence, enhancing LDA initialization without modifying its core generative process.

Findings

01

PRISM improves topic coherence and interpretability in text and RNA-seq data.

02

It rivals models that incorporate external knowledge.

03

Code is available at the provided GitHub URL.

Abstract

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shaham-lab/PRISM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.