PRISM: LLM-Guided Semantic Clustering for High-Precision Topics
Connor Douglas, Utkucan Balci, Joseph Aylett-Bullock

TL;DR
PRISM is a novel framework that combines large language models with semantic clustering to improve topic discovery and separation in text corpora efficiently and interpretably.
Contribution
It introduces a student-teacher pipeline for distilling LLM supervision into lightweight models and demonstrates effective web-scale text analysis.
Findings
PRISM outperforms state-of-the-art local topic models in topic separability.
Requires only a small number of LLM queries for training.
Enables interpretable, locally deployable web-scale text analysis.
Abstract
In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
