Enhancing BERTopic with Intermediate Layer Representations

Dominik Koterwa; Maciej \'Swita{\l}a

arXiv:2505.06696·cs.CL·May 13, 2025

Enhancing BERTopic with Intermediate Layer Representations

Dominik Koterwa, Maciej \'Swita{\l}a

PDF

Open Access 1 Repo

TL;DR

This paper evaluates 18 different embedding representations for BERTopic, demonstrating that alternative configurations can outperform default settings in topic coherence and diversity across diverse datasets.

Contribution

The study systematically compares various intermediate layer embeddings for BERTopic, revealing optimal configurations and the impact of stop words on topic modeling performance.

Findings

01

Certain embedding configurations outperform default BERTopic settings.

02

Stop words influence the quality of topic representations.

03

Performance varies across different datasets.

Abstract

BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm's performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dkoterwa/optimizing_bertopic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies