Enhancing BERTopic with Intermediate Layer Representations
Dominik Koterwa, Maciej \'Swita{\l}a

TL;DR
This paper evaluates 18 different embedding representations for BERTopic, demonstrating that alternative configurations can outperform default settings in topic coherence and diversity across diverse datasets.
Contribution
The study systematically compares various intermediate layer embeddings for BERTopic, revealing optimal configurations and the impact of stop words on topic modeling performance.
Findings
Certain embedding configurations outperform default BERTopic settings.
Stop words influence the quality of topic representations.
Performance varies across different datasets.
Abstract
BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm's performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies
