Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling

Florian Eichin; Carolin M. Schuster; Georg Groh; and Michael A. Hedderich

arXiv:2410.21054·cs.CL·September 29, 2025

Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling

Florian Eichin, Carolin M. Schuster, Georg Groh, and Michael A. Hedderich

PDF

Open Access 1 Repo

TL;DR

Semantic Component Analysis (SCA) is a scalable topic modeling method that discovers multiple topics per document, outperforming existing models in diversity and efficiency on large multilingual datasets.

Contribution

SCA introduces a decomposition step into clustering-based topic modeling, enabling multiple topics per sample and improved scalability.

Findings

01

Achieves competitive coherence and diversity compared to BERTopic.

02

Uncovers at least double the number of topics with low noise.

03

Outperforms LLM-based TopicGPT under similar compute budgets.

Abstract

Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competitive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mainlp/semantic_components
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSemantic Cross Attention