BERTopic: Neural topic modeling with a class-based TF-IDF procedure
Maarten Grootendorst

TL;DR
BERTopic introduces a neural topic modeling approach that combines transformer-based embeddings, clustering, and a class-based TF-IDF method to produce coherent and competitive topics from document collections.
Contribution
It extends clustering-based topic modeling with a novel class-based TF-IDF, improving topic coherence and performance.
Findings
Produces coherent topics effectively
Remains competitive across multiple benchmarks
Combines transformer embeddings with clustering and TF-IDF
Abstract
Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay · Dense Connections · Weight Decay
