TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction
Aoi Fujita, Taichi Yamamoto, Yuri Nakayama, and Ryota Kobayashi

TL;DR
TopiCLEAR is a novel clustering-based method that extracts interpretable topics from short social media texts by combining embedding, adaptive dimensionality reduction, and iterative refinement, outperforming existing approaches.
Contribution
It introduces a new unsupervised topic extraction technique that operates directly on raw text, avoiding preprocessing and improving interpretability and accuracy on social media data.
Findings
Achieves highest similarity to human-labeled topics among tested methods.
Significantly outperforms baseline models on diverse datasets.
Produces more interpretable and meaningful topics.
Abstract
Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Sentiment Analysis and Opinion Mining · Complex Network Analysis Techniques
