Holistic Evaluations of Topic Models

Thomas Compton

arXiv:2507.23364·cs.IR·August 1, 2025

Holistic Evaluations of Topic Models

Thomas Compton

PDF

Open Access

TL;DR

This paper evaluates the performance and interpretability of topic models, specifically BERTopic, by analyzing 1140 runs to understand trade-offs in parameter optimization and implications for responsible use.

Contribution

It provides a comprehensive database-driven analysis of topic model evaluations, highlighting key trade-offs and interpretability issues.

Findings

01

Identified optimal parameter settings for BERTopic

02

Revealed trade-offs between model complexity and interpretability

03

Provided insights into responsible application of topic models

Abstract

Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a 'black box', where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Data Quality and Management · Expert finding and Q&A systems