Disentangling Similarity and Relatedness in Topic Models
Hanlin Xiao, Mauricio A. \'Alvarez, Rainer Breitling

TL;DR
This paper introduces a method to distinguish between similarity and relatedness in topic models, using a synthetic benchmark and neural scoring, revealing how different models capture semantic structures and impact downstream tasks.
Contribution
It presents a novel approach to disentangle similarity and relatedness in topic models, providing a benchmark and evaluation pipeline for better understanding semantic capture.
Findings
Different model families capture distinct semantic structures.
Similarity and relatedness scores predict downstream task performance.
The proposed benchmark enables systematic evaluation of semantic axes.
Abstract
The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Sentiment Analysis and Opinion Mining
