Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling
Florian Eichin, Carolin M. Schuster, Georg Groh, and Michael A. Hedderich

TL;DR
Semantic Component Analysis (SCA) is a scalable topic modeling method that discovers multiple topics per document, outperforming existing models in diversity and efficiency on large multilingual datasets.
Contribution
SCA introduces a decomposition step into clustering-based topic modeling, enabling multiple topics per sample and improved scalability.
Findings
Achieves competitive coherence and diversity compared to BERTopic.
Uncovers at least double the number of topics with low noise.
Outperforms LLM-based TopicGPT under similar compute budgets.
Abstract
Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competitive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSemantic Cross Attention
